FAQ
Validating input paths and creating splits is slow on S3
--------------------------------------------------------

Key: HADOOP-3095
URL: https://issues.apache.org/jira/browse/HADOOP-3095
Project: Hadoop Core
Issue Type: Improvement
Components: fs, fs/s3
Reporter: Tom White


A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.

But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:

1. fs.isDirectory
2. fs.exists
3. fs.getLength
4. fs.getLength
5. fs.exists (from fs.getFileBlockLocations)
6. fs.getBlockSize

So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).

For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Doug Cutting (JIRA) at Mar 26, 2008 at 6:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582386#action_12582386 ]

    Doug Cutting commented on HADOOP-3095:
    --------------------------------------

    We should fix this generally, not just for S3, since, once we remove the HDFS status cache, it will be similarly slow. A good approach would be to rework FileInputFormat to retain FileStatus instances. FileInputFormat#listPaths() should be deprecated in favor of a listStatus() method.

    As for validateInput, this can be deprecated too, or at least gutted. Its utility was in the days when getSplits() was called on the JobTracker, and it was nice to get such error messages in the client before submitting the job. But now that getSplits() is called in the client, any needed checks can be done there (i.e., it should throw exceptions for non-existent input paths). So my vote would be to long-term eliminate validateInput() and short-term deprecate and gut its implementation in FileInputFormat.

    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White

    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 31, 2008 at 10:00 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583897#action_12583897 ]

    Owen O'Malley commented on HADOOP-3095:
    ---------------------------------------

    This changed in trunk. Look at HADOOP-2027. It should use getFileBlockLocations for each path now.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White

    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Mar 31, 2008 at 10:54 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583917#action_12583917 ]

    Doug Cutting commented on HADOOP-3095:
    --------------------------------------

    FileInputFormat#getSplits() in trunk still calls FileSystem#exists(), FileSystem#isDirectory(), FileSystem#getLength() (twice) and FileSystem#getBlockSize() for each input file. These could all be eliminated if it used listStatus() instead of listPaths(). The cost of these is negligible for HDFS currently, since these are all cached in the Path instance. But once we eliminate that cache, performance on HDFS will suffer.

    FileSystem#getBlockLocations() is also called once per file and would still need to be called, since FileStatus doesn't (yet) include that information.

    Also, do you see a reason for validateInput to continue to exist? Without the status cache it too will become very slow, and its purpose is now outdated.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White

    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Mar 31, 2008 at 11:06 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583923#action_12583923 ]

    Owen O'Malley commented on HADOOP-3095:
    ---------------------------------------

    Sorry about that. I thought we'd cleaned up more of the input split creation in HADOOP-2027. Of course we should keep the FileStatus around instead of calling FileSystem methods over and over again.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White

    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Apr 1, 2008 at 6:24 am
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Owen O'Malley updated HADOOP-3095:
    ----------------------------------

    Attachment: faster-job-init.patch

    I haven't really tested this, but it seems like the right direction. Thoughts?
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Attachments: faster-job-init.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Apr 1, 2008 at 8:49 am
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584058#action_12584058 ]

    Tom White commented on HADOOP-3095:
    -----------------------------------

    This looks good to me. A couple of points:

    1. As it stands FileInputFormat#listPaths() is deprecated and not called. However, this will silently change the behaviour of applications which have overridden this method - and which are not recompiled - since it will never be called. It would be nice to issue a warning at runtime, but I'm not sure how we do this though, since it depends on being able to detect the existence of an overridden method. Any thoughts?

    2. There is still an FileSystem#exists() call in FileSystem#getBlockLocations(). Could we change/overload the latter to be FileSystem#getBlockLocations(FileStatus, long, long)?
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Attachments: faster-job-init.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • lohit vijayarenu (JIRA) at Apr 1, 2008 at 9:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584071#action_12584071 ]

    lohit vijayarenu commented on HADOOP-3095:
    ------------------------------------------
    2. There is still an FileSystem#exists() call in FileSystem#getBlockLocations(). Could we change/overload the latter to be FileSystem#getBlockLocations(FileStatus, long, long)?
    HADOOP-2634 might be related. It tries to implement exists() using getFileStatus.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Attachments: faster-job-init.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Apr 1, 2008 at 4:32 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584205#action_12584205 ]

    Doug Cutting commented on HADOOP-3095:
    --------------------------------------

    Shouldn't we deprecate validateInput() too?

    Also, I second Tom's concern about breaking applications that override listPaths(). If we assume that application code only overrides listPaths() but doesn't call it, then we might change the default implementation to return a special value, indicating that listPaths() has not been overridden. Then listStatus() can call this and use the returned paths if the value is not the special value, also emitting a warning that the method has been deprecated. If we wanted to get really fancy, we could only return the special value when called under listStatus() by setting a threadlocal, for 100% back-compatibility. Ugly, huh?

    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Attachments: faster-job-init.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Apr 19, 2008 at 12:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Owen O'Malley reassigned HADOOP-3095:
    -------------------------------------

    Assignee: Owen O'Malley
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Attachments: faster-job-init.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 2, 2008 at 1:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Fix Version/s: 0.18.0
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 2, 2008 at 1:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Attachment: hadoop-3095.patch

    I've extended Owen's patch to

    1. Add an overloaded version of FileSystem#getBlockLocations() that takes a FileStatus object so implementations of FileSystem have a way of avoiding an exists call per file.
    2. Deprecate validateInput.
    3. For back compatibility call listPath and listStatus, and if the paths are the same use the FileStatus objects, otherwise fall back to the existing behaviour (use the Path objects) and issue a warning. When we remove listPath in a subsequent release we can remove this check.

    I couldn't see how to get the ThreadLocal approach suggested by Doug to work if a subclass implementation of listPath calls its superclass (which is a common case, since the subclass typically modifies the list returned by the superclass). Calling both listPath and listStatus does add some overhead, but with validateInput gone the patch shouldn't make things worse overall.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jun 2, 2008 at 6:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601702#action_12601702 ]

    Doug Cutting commented on HADOOP-3095:
    --------------------------------------

    Tom, this looks good to me. Perhaps we should deprecate FileSystem#getBlockLocations(Path) and update all implementations to define FileSystem#getBlockLocations(FileStatus). We'll need to make that change for HADOOP-2565 anyway, and since the primary client of this method calls only the new signature, I don't see a need for implementations to support two signatures here.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 2, 2008 at 7:46 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Status: Patch Available (was: Open)
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 2, 2008 at 7:46 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Attachment: hadoop-3095-v2.patch

    Reworked patch to make implementations override FileSystem#getFileBlockLocations(FileStatus) rather than FileSystem#getFileBlockLocations(Path) - which is now deprecated. The latter is now just delegates to the former, by calling

    {code}
    return getFileBlockLocations(getFileStatus(f), start, len);
    {code}

    KosmosFileSystem didn't override FileSystem#getFileBlockLocations(Path) as it had a method called getBlockLocations instead, which would have broken data locality in mapreduce. I've fixed this and marked it as @Override.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jun 2, 2008 at 7:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601745#action_12601745 ]

    Doug Cutting commented on HADOOP-3095:
    --------------------------------------

    +1 This looks fine to me.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jun 2, 2008 at 10:30 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601776#action_12601776 ]

    Hadoop QA commented on HADOOP-3095:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12383246/hadoop-3095-v2.patch
    against trunk revision 662515.

    +1 @author. The patch does not contain any @author tags.

    -1 tests included. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    -1 javac. The applied patch generated 448 javac compiler warnings (more than the trunk's current 410 warnings).

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    -1 core tests. The patch failed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2539/console

    This message is automatically generated.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 3, 2008 at 12:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Attachment: hadoop-3095-v3.patch

    New patch to fix unit test failures. Two failures were unrelated to this patch, but three showed up a problem with Path objects in FileStatus not being fully qualified. The problem arises since FileInputFormat#listStatus doesn't return fully-qualified paths so #getSplits can't pick up the correct FileSystem(s).

    I've changed DistributedFileSystem to return a fully-qualified Path in getFileStatus(). However, this doesn't fully solve the problem as other FileSystem implementations don't return fully-qualified paths in FileStatus. Should we strengthen the contract of FileSystem to require them to do so?

    (The increased warning count is due to the addition of deprecated methods that are still called, mainly from tests.)
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Jun 3, 2008 at 3:51 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601982#action_12601982 ]

    Owen O'Malley commented on HADOOP-3095:
    ---------------------------------------

    I think that getFileStatus should return a fully qualified Path. I've got a start to a patch that does it for the ramfs, because that was causing me trouble.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jun 3, 2008 at 5:43 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602003#action_12602003 ]

    Doug Cutting commented on HADOOP-3095:
    --------------------------------------

    Yes, all Path instances returned from a FileSystem method should be fully-qualified.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 3, 2008 at 8:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Attachment: hadoop-3095-v4.patch

    This patch makes all FileSystem implementations return fully-qualified Path objects in FileStatus.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 3, 2008 at 8:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Status: In Progress (was: Patch Available)
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Owen O'Malley
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 3, 2008 at 8:55 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Status: Patch Available (was: In Progress)
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 3, 2008 at 8:55 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White reassigned HADOOP-3095:
    ---------------------------------

    Assignee: Tom White (was: Owen O'Malley)
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jun 4, 2008 at 12:07 am
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602123#action_12602123 ]

    Hadoop QA commented on HADOOP-3095:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch
    against trunk revision 662913.

    +1 @author. The patch does not contain any @author tags.

    -1 tests included. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    -1 patch. The patch command could not apply the patch.

    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2556/console

    This message is automatically generated.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jun 4, 2008 at 8:00 am
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602216#action_12602216 ]

    Hadoop QA commented on HADOOP-3095:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12383321/hadoop-3095-v4.patch
    against trunk revision 662976.

    +1 @author. The patch does not contain any @author tags.

    -1 tests included. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    -1 javac. The applied patch generated 447 javac compiler warnings (more than the trunk's current 409 warnings).

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    +1 core tests. The patch passed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2565/console

    This message is automatically generated.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 4, 2008 at 9:34 am
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602227#action_12602227 ]

    Tom White commented on HADOOP-3095:
    -----------------------------------

    The failed contrib test passes when I run it on my local machine.
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at Jun 4, 2008 at 8:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Owen O'Malley updated HADOOP-3095:
    ----------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Tom!
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 5, 2008 at 1:50 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Release Note: This is an incompatible change for FileSystem implementations which override getFileBlockLocations(Path, long, long). The signature of this method should be changed to getFileBlockLocations(FileStatus, long, long).
    Hadoop Flags: [Incompatible change, Reviewed] (was: [Reviewed])
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Chansler (JIRA) at Jun 27, 2008 at 9:08 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Chansler updated HADOOP-3095:
    ------------------------------------

    Release Note: Changed method getFileBlockLocations to getFileBlockLocations(FileStatus, long, long). (was: This is an incompatible change for FileSystem implementations which override getFileBlockLocations(Path, long, long). The signature of this method should be changed to getFileBlockLocations(FileStatus, long, long).)
    Hadoop Flags: [Incompatible change, Reviewed] (was: [Reviewed, Incompatible change])
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jul 1, 2008 at 5:13 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tom White updated HADOOP-3095:
    ------------------------------

    Release Note: Added overloaded method getFileBlockLocations(FileStatus, long, long). This is an incompatible change for FileSystem implementations which override getFileBlockLocations(Path, long, long). They should have the signature of this method changed to getFileBlockLocations(FileStatus, long, long) to work correctly. (was: Changed method getFileBlockLocations to getFileBlockLocations(FileStatus, long, long). )
    Hadoop Flags: [Incompatible change, Reviewed] (was: [Reviewed, Incompatible change])
    Validating input paths and creating splits is slow on S3
    --------------------------------------------------------

    Key: HADOOP-3095
    URL: https://issues.apache.org/jira/browse/HADOOP-3095
    Project: Hadoop Core
    Issue Type: Improvement
    Components: fs, fs/s3
    Reporter: Tom White
    Assignee: Tom White
    Fix For: 0.18.0

    Attachments: faster-job-init.patch, hadoop-3095-v2.patch, hadoop-3095-v3.patch, hadoop-3095-v4.patch, hadoop-3095.patch


    A call to listPaths on S3FileSystem results in an S3 access for each file in the directory being queried. If the input contains hundreds or thousands of files this is prohibitively slow. This method is called in FileInputFormat.validateInput and FileInputFormat.getSplits. This would be easy to fix by overriding listPaths (all four variants) in S3FileSystem to not use listStatus which creates a FileStatus object for each subpath. However, listPaths is deprecated in favour of listStatus so this might be OK as a short term measure, but not longer term.
    But it gets worse: FileInputFormat.getSplits goes on to access S3 a further six times for each input file via these calls:
    1. fs.isDirectory
    2. fs.exists
    3. fs.getLength
    4. fs.getLength
    5. fs.exists (from fs.getFileBlockLocations)
    6. fs.getBlockSize
    So it would be best to change getSplits to use listStatus, and only access S3 once for each file. (This would help HDFS too.) This change would require some care since FileInputFormat has a protected method listPaths which subclasses can override (although, in passing I notice validateInput doesn't use listPaths - is this a bug?).
    For input validation, one approach would be to disable it for S3 by creating a custom FileInputFormat. In this case, missing files would be detected during split generation. Alternatively, it may be possible to cache the input paths between validateInput and getSplits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 26, '08 at 5:33p
activeJul 1, '08 at 5:13p
posts31
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Tom White (JIRA): 31 posts

People

Translate

site design / logo © 2022 Grokbase