Grokbase Groups Pig dev February 2008
FAQ
Dont copy to DFS if source filesystem marked as shared
------------------------------------------------------

Key: PIG-102
URL: https://issues.apache.org/jira/browse/PIG-102
Project: Pig
Issue Type: New Feature
Components: impl
Environment: Installations with shared folders on all nodes (eg NFS)
Reporter: Craig Macdonald


I've been playing with Pig using three setups:
(a) local
(b) hadoop mapred with hdfs
(c) hadoop mapred with file:///path/to/shared/fs as the default file system

In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local

I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.

Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems

It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
An example in our setup might be:
sharedFS file:///local/
sharedFS file:///users/
if commands should be used.

This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.

The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Pi Song (JIRA) at Mar 7, 2008 at 2:01 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576200#action_12576200 ]

    Pi Song commented on PIG-102:
    -----------------------------

    I only know that Hadoop MapReduce local mode can run on local FS.
    Can Hadoop MapReduce in distributed mode run not using HDFS? Someone please explain this.

    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald

    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Benjamin Reed (JIRA) at Mar 7, 2008 at 3:15 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576227#action_12576227 ]

    Benjamin Reed commented on PIG-102:
    -----------------------------------

    Yes it can. We open the file in PigInputFormat, so we can get the file from whereever we want.

    I really like this proposal. My only comment would be that it might be nice to use shared:/path for files you don't want loaded into Hadoop rather than using a configuration file to mark the shared directories. The motivation for this has to do with what Craig pointed out: if you have 1000 machines accessing the shared file, you might want to still copy it to HDFS. That scenario is dependent upon the number of machines, not the directory. For example, you may have a job doing a join against a dataset in the NFS directory /nfs/Top10MillionPhrases. If your first job uses only 20 machines to join against a rather small dataset, you would probably used shared:/nfs/Top10MillionPhrases. On the other hand if you were joining with a 10T dataset, you would probably use file:/nfs/Top10MillionPhrases so that it gets copied to HDFS.
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald

    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Craig Macdonald (JIRA) at Mar 7, 2008 at 3:45 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576244#action_12576244 ]

    Craig Macdonald commented on PIG-102:
    -------------------------------------

    HI Benjamin,

    That's a simpler concept, which is easier to implement.

    The disadvantage is that is less transparent to the user. An administrator might set the sharedFS globally, while if we use the shared:/ prefix idea then the user has to remember about shared:/, and make a decision for each job whether its a local file, a local-but-shared file or a DFS file, instead of just the foremost and latter.

    Craig
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald

    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Craig Macdonald (JIRA) at Mar 7, 2008 at 3:49 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Craig Macdonald updated PIG-102:
    --------------------------------

    Attachment: shared.patch

    Initial patch implementing Benjamin's proposed shared directive.

    Benjamin, I think this is all the changes required for this to work?
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Mar 7, 2008 at 9:49 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576385#action_12576385 ]

    Pi Song commented on PIG-102:
    -----------------------------

    Question!!!
    As far as I know HDFS has copy-on-write implementation that helps protect against changes during execution. How about in case of using local FS, or this shared NFS? Is there any mechanism to help? Sounds like we're relying on the implementation of file system?
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Benjamin Reed (JIRA) at Mar 7, 2008 at 9:55 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576390#action_12576390 ]

    Benjamin Reed commented on PIG-102:
    -----------------------------------

    Craig, with respect to your patch, you also need to make a corresponding change in PigInputFormat to know if the filename refers to something in HDFS or the local fs. I think the easiest way to do this would be to not strip off the shared: from the filename.
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Benjamin Reed (JIRA) at Mar 7, 2008 at 9:57 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576394#action_12576394 ]

    Benjamin Reed commented on PIG-102:
    -----------------------------------

    Pi, currently HDFS is write-once. Append is coming, so there will not be protection against changes during execution.
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Mar 9, 2008 at 1:21 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576769#action_12576769 ]

    Pi Song commented on PIG-102:
    -----------------------------

    Just out of curiosity again. When Hadoop MapReduce runs using LocalFS. By default, how do they transfer data between nodes?
    Was this capability intended to be used in cases just like Craig's NFS?
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Benjamin Reed (JIRA) at Mar 10, 2008 at 2:51 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577018#action_12577018 ]

    Benjamin Reed commented on PIG-102:
    -----------------------------------

    In Hadoop, if you use local instead of an HDFS namenode as the file system, the local files must be shared.

    Pig allows a mix. if you are running with HDFS and you want to use local files, you can specify a file: prefix to your files and the local file will be transferred to HDFS before (or after depending on the context) the job runs. If you throw a shared file system into the picture, you may want a mix. Since the shared file system is treated as a local FS you have the option of copying or not. Copying will give you more scalability. Not copying reduces latency (sometimes).
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Craig Macdonald (JIRA) at Mar 10, 2008 at 3:11 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577024#action_12577024 ]

    Craig Macdonald commented on PIG-102:
    -------------------------------------

    Hi Ben,

    I looked at PigInputFormat, and this essentially looked OK: The correct filesystem is identified for each path individually.

    However, after this i'm a bit lost. A simple test case fails, as somewhere a local path is not being used with the correct file system. I know what the exception means, just how to find out *where* it fails.

    {noformat}
    2008-03-10 15:01:27,698 [main] ERROR org.apache.pig.tools.grunt.Grunt - Wrong FS: file:/path/to/url.sample, expected: hdfs://node04:56228
    {noformat}

    I just have no idea how to force a stack trace to be shown. Can anyone comment here (Stefan?) on how to enable traces on log.error()?

    Benjamin, I was hopeful if the proper scheme I (i.e. file: after hadoopify) is left on, then the proper file system will be selected by the Hadoop layer. I suspect that HDataStorage, HFile & HDirectory etc will have to change such that they obtain the correct filesystem for each datastorage element. Generally speaking, the PIG-32 backend assumes a single file-system for a single backend, an assumption that this JIRA challenges.

    Craig
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Benjamin Reed (JIRA) at Mar 10, 2008 at 6:08 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577100#action_12577100 ]

    Benjamin Reed commented on PIG-102:
    -----------------------------------

    PigInputFormat doesn't look for any special prefixes. It assumes that all inputs are in HDFS. PigOutputFormat has the same issue. So, these classes are going to need to change to do what they do now in the absence of special prefixes and use local filesystem for shared: prefix.
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Benjamin Reed (JIRA) at Mar 10, 2008 at 6:14 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577102#action_12577102 ]

    Benjamin Reed commented on PIG-102:
    -----------------------------------

    The Wrong FS message is coming from Hadoop: src/java/org/apache/hadoop/fs/FileSystem.java: throw new IllegalArgumentException("Wrong FS: "+path+

    It's happening because you are going through HDFS instead of the local file system. A stack trace will be of great benefit :)
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Craig Macdonald (JIRA) at Mar 10, 2008 at 6:22 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577107#action_12577107 ]

    Craig Macdonald commented on PIG-102:
    -------------------------------------

    Benjamin, Yup, know the line in Hadoop the exception comes from, just not how to find the damn stack trace without hacking hadoop.

    My thoughts on PigInputFormat and PigOutputFormat was that if the prefixes were to persist until the PigInputFormat layer, then all should be ok, as Hadoop can handle both file and shared prefixes.

    shared: will just be marker scheme in Grunt, that FileLocalizer transforms to the file:scheme, instead of copying the file and transforming to hdfs scheme. That's the vague idea anyway. I think the problems arise in HFile et al.

    C
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Mar 11, 2008 at 1:36 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577442#action_12577442 ]

    Pi Song commented on PIG-102:
    -----------------------------

    Ben, Craig

    Just a couple of small things:-
    - "Shared" doesn't give me any clue. Can we think about this copying to HDFS behavior as "staging" to be generic? Therefore the flag keyword should be "no-stage" (or something like that).
    - If the source file is in local file system, should then the output be copied back to local file system? (a bit off topic but gives a sense of staging)
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Craig Macdonald (JIRA) at Mar 11, 2008 at 2:02 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577449#action_12577449 ]

    Craig Macdonald commented on PIG-102:
    -------------------------------------

    Pi,

    That's a fair comment. Relatedly, Torque (PBS) has stage-in and stage-out directives for jobs.

    C
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Benjamin Reed (JIRA) at Mar 11, 2008 at 5:38 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577526#action_12577526 ]

    Benjamin Reed commented on PIG-102:
    -----------------------------------

    There is no need for copying back. Inputs don't get changed.

    I'm not a fan of thinking of it as staging. The fact that we move local files to HDFS is an implementation detail. 'file:' indicates that the data is in the file system rather than HDFS.

    I also like shared because it indicates that it is data shared by all the machines and you want to take advantage of it. (Note, it doesn't have to be NFS. If you rsync a directory across all machines, that is going to work as well.)

    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Mar 11, 2008 at 10:30 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577633#action_12577633 ]

    Pi Song commented on PIG-102:
    -----------------------------

    Ben,

    I think about it this way (you may disagree):-

    - The basic concept is you've got source file system X, execution engine Y, and destination file system. HDFS+MapReduce where source files are in HDFS perfectly fits with this model. Craig's NFS + MapReduce "no copy across" also fits.
    - Now if you have input files in a file system X and say the execution engine Y only executes on its own file system Z. Then it is a responsbility of the execution engine to pull files from the source file system to its temporary storage Z in order to execute. Therefore after it's done, the output should be copied back to the real file system (Leaving the output on temporary storage Z doesn't sound good). I'm just trying to define a good semantic in the first place.
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Pi Song (JIRA) at Mar 19, 2008 at 2:10 pm
    [ https://issues.apache.org/jira/browse/PIG-102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580380#action_12580380 ]

    Pi Song commented on PIG-102:
    -----------------------------

    Craig,
    Please keep going with this issue.
    Any direction discussed in here is an improvement.
    Dont copy to DFS if source filesystem marked as shared
    ------------------------------------------------------

    Key: PIG-102
    URL: https://issues.apache.org/jira/browse/PIG-102
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Environment: Installations with shared folders on all nodes (eg NFS)
    Reporter: Craig Macdonald
    Attachments: shared.patch


    I've been playing with Pig using three setups:
    (a) local
    (b) hadoop mapred with hdfs
    (c) hadoop mapred with file:///path/to/shared/fs as the default file system
    In our local setup, various NFS filesystems are shared between all machines (including mapred nodes) eg /users, /local
    I would like Pig to note when input files are in a file:// directory that has been marked as shared, and hence not copy it to DFS.
    Similarly, the Torque PBS resource manager has a usecp directive, which notes when a filesystem location is shared between all nodes, (and hence scp is not needed, cp alone can be used). See http://www.clusterresources.com/wiki/doku.php?id=torque:6.2_nfs_and_other_networked_filesystems
    It would be good to have a configurable setting in Pig that says when a filesystem is shared, and hence no copying between file:// and hdfs:// is needed.
    An example in our setup might be:
    sharedFS file:///local/
    sharedFS file:///users/
    if commands should be used.
    This command should be used with care. Obviously if you have 1000 nodes all accessing a shared file in NFS, then it would have been better to "hadoopify" the file.
    The likely area of code to patch is src/org/apache/pig/impl/io/FileLocalizer.java hadoopify(String, PigContext)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedFeb 11, '08 at 12:12p
activeMar 19, '08 at 2:10p
posts19
users1
websitepig.apache.org

1 user in discussion

Pi Song (JIRA): 19 posts

People

Translate

site design / logo © 2022 Grokbase