FAQ
FileSystem should not name files with java.io.File
--------------------------------------------------

Key: HADOOP-129
URL: http://issues.apache.org/jira/browse/HADOOP-129
Project: Hadoop
Type: Improvement

Components: fs
Versions: 0.1.1, 0.1.0
Reporter: Doug Cutting
Fix For: 0.2


In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.

To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().

This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

Search Discussions

  • Andrzej Bialecki (JIRA) at Apr 10, 2006 at 8:37 pm
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=comments#action_12373924 ]

    Andrzej Bialecki commented on HADOOP-129:
    ------------------------------------------

    I think we should change this to a Hadoop-specific class, e.g. FileName (not a simple String - too limiting). FileName-s could only be used when holding a reference to a valid instance of FileSystem - this way operations like getParent() could always consult FileSystem-specific routines to resolve DFS names to real names in case of LocalFileSystem.

    I also propose that this class should be versioned, and contain some File-like metadata - for now I'm thinking specifically about creation / modification time.
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.1, 0.1.0
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Apr 10, 2006 at 9:00 pm
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=comments#action_12373928 ]

    Doug Cutting commented on HADOOP-129:
    -------------------------------------
    I think we should change this to a Hadoop-specific class, e.g. FileName.
    Why not URI? What required methods are missing from URI? Conversely, what URI methods do you think might cause problems?

    Partially answering my own question, with URIs we'd have to check the schema host and port matched the fs when implementing each FS method. In other words, given that we need a FileSystem instance to do anything, the schema, host and port fields of the URI are usually redundant and force us to perform error checking. However these same fields would be useful when specifying MapReduce input and output directories, in command lines, etc., permitting one to easily specify non-default FileSystem implementations.

    Note that I don't think URI buys us interoperability with other systems. So we should only use it if we think it will make writing Hadoop easier: if it consists of code that we'd need to mostly need to write anyway.

    A side-benefit of URI is that it provides standards-defined filename syntax. We don't have to figure out how to, e.g., escape things, or how backslashes and colons should be treated, etc. We can simply point to a standard.
    I also propose that this class should be versioned, and contain some File-like metadata - for now I'm thinking specifically about creation / modification time.
    This works so long as files are write-once. But if they can be appended to or overwritten then this information could get stale.
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.1, 0.1.0
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Apr 11, 2006 at 6:23 pm
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=comments#action_12374087 ]

    Doug Cutting commented on HADOOP-129:
    -------------------------------------

    URI actually *can* compute parent directory. For example:

    URI subDir = new URI("/foo/bar/baz/");
    URI parent = subDir.resolve("..");

    Parent.toString() returns "/foo/bar/".

    So I think that URI has the features we want for filenames and not much else. Am I missing something?

    It might also be useful to implement a URLStreamHandler, so that one can create "hdfs:" urls and use them whereever java accepts URLs, e.g., in classloaders, etc. But the URL class doesn't support relative path name resolution, the primary feature we require for names.

    Unless there are objections, I'll start exploring replacing the uses of java.io.File with java.net.URI.

    My thinking is that we remove rather than deprecate the old methods. This makes the change incompatible, but I think we really want to get rid of the use of java.io.File. I'm willing to update Nutch & unit tests as required, but this may break others' code. Should we instead deprecate these in Hadoop 0.2 and then remove them in 0.3? Thoughts?
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.1, 0.1.0
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Eric Baldeschwieler at Apr 11, 2006 at 9:19 pm
    I'm all for the change!
    Of course we don't have a nutch base to upgrade.
    On Apr 11, 2006, at 11:23 AM, Doug Cutting (JIRA) wrote:

    [ http://issues.apache.org/jira/browse/HADOOP-129?
    page=comments#action_12374087 ]

    Doug Cutting commented on HADOOP-129:
    -------------------------------------

    URI actually *can* compute parent directory. For example:

    URI subDir = new URI("/foo/bar/baz/");
    URI parent = subDir.resolve("..");

    Parent.toString() returns "/foo/bar/".

    So I think that URI has the features we want for filenames and not
    much else. Am I missing something?

    It might also be useful to implement a URLStreamHandler, so that
    one can create "hdfs:" urls and use them whereever java accepts
    URLs, e.g., in classloaders, etc. But the URL class doesn't
    support relative path name resolution, the primary feature we
    require for names.

    Unless there are objections, I'll start exploring replacing the
    uses of java.io.File with java.net.URI.

    My thinking is that we remove rather than deprecate the old
    methods. This makes the change incompatible, but I think we really
    want to get rid of the use of java.io.File. I'm willing to update
    Nutch & unit tests as required, but this may break others' code.
    Should we instead deprecate these in Hadoop 0.2 and then remove
    them in 0.3? Thoughts?
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.1, 0.1.0
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using
    java.io.File. This is confusing, as many methods on that class
    are inappropriate to call on Hadoop paths. For example, calling
    isDirectory(), exists(), etc. on a java.io.File is not the same as
    calling FileSystem.isDirectory() or FileSystem.exists() passing
    that same file. Using java.io.File also makes correct operation
    on Windows difficult, since java.io.File operates differently on
    Windows in order to accomodate Windows path names. For example,
    new File("/foo") is not absolute on Windows, and prints its path
    as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the
    FileSystem API with String, a new FileName class, or perhaps
    java.net.URI. The advantage of URI is that it can also naturally
    include the namenode host and port. The disadvantage is that URI
    does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should
    probably be made early in a development cycle in order to maximize
    the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the
    administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Apr 11, 2006 at 11:34 pm
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=comments#action_12374115 ]

    Doug Cutting commented on HADOOP-129:
    -------------------------------------

    Working through this more, I'm now leaning away from URI and towards a new class. It will be easier to replace with a new class, since the API can be made to resemble File. For example, we have a lot of code that calls 'new File(dir, name)' to construct a file in a subdirectory. The idiom for doing that with URI's is slightly more complicated, and would require a utility method somewhere. Similarly for file.getParentFile(), etc.

    So now I'm leaning towards a class named "Path" that's mostly a drop-in replacement for File, except it doesn't support FS operations like exists(), mkdir(), delete(), etc.
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.1, 0.1.0
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • eric baldeschwieler (JIRA) at Apr 16, 2006 at 5:43 am
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=comments#action_12374671 ]

    eric baldeschwieler commented on HADOOP-129:
    --------------------------------------------

    It could contain a URI...
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.0, 0.1.1
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Igor Bolotin (JIRA) at Apr 16, 2006 at 5:50 am
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=comments#action_12374672 ]

    Igor Bolotin commented on HADOOP-129:
    -------------------------------------

    Does it make sense to create class that would extend File and override unsupported operations to throw UnsupportedOperationException?
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.0, 0.1.1
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Apr 17, 2006 at 10:35 pm
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=comments#action_12374824 ]

    Doug Cutting commented on HADOOP-129:
    -------------------------------------
    Does it make sense to create class that would extend File and override unsupported operations to throw UnsupportedOperationException?
    I'm not sure what advantages that would have. Is the idea to detect errors at runtime rather than at compile time? I've just about finished a patch adding a new class. I'll post it later today.
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.0, 0.1.1
    Reporter: Doug Cutting
    Fix For: 0.2
    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Apr 17, 2006 at 11:16 pm
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=all ]

    Doug Cutting updated HADOOP-129:
    --------------------------------

    Attachment: path.patch

    Here's a patch that replaces uses of java.io.File in Hadoop's FileSystem and MapReduce API's with a new class named Path. I left some existing File-based methods, now deprecated, sufficient for Nutch to run w/o alteration. I'd like to remove the deprecated methods after the 0.2 release.

    I believe that the only incompatible change is that dfs.data.dir and mapred.local.dir, when lists of directories, must now be comma-separated and may no longer be space-separated. This is in order to make things work better on Windows.

    I have tested this in standalone and pseudo-distributed operation on both Linux and Windows, with unit tests and with the Nutch crawler.

    Barring objections, I will apply this tomorrow.
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.0, 0.1.1
    Reporter: Doug Cutting
    Fix For: 0.2
    Attachments: path.patch

    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Apr 18, 2006 at 5:08 pm
    [ http://issues.apache.org/jira/browse/HADOOP-129?page=all ]

    Doug Cutting resolved HADOOP-129:
    ---------------------------------

    Resolution: Fixed
    Assign To: Doug Cutting

    I just committed this. It was a big change. I hope I haven't broken anything!
    FileSystem should not name files with java.io.File
    --------------------------------------------------

    Key: HADOOP-129
    URL: http://issues.apache.org/jira/browse/HADOOP-129
    Project: Hadoop
    Type: Improvement
    Components: fs
    Versions: 0.1.0, 0.1.1
    Reporter: Doug Cutting
    Assignee: Doug Cutting
    Fix For: 0.2
    Attachments: path.patch

    In Hadoop's FileSystem API, files are currently named using java.io.File. This is confusing, as many methods on that class are inappropriate to call on Hadoop paths. For example, calling isDirectory(), exists(), etc. on a java.io.File is not the same as calling FileSystem.isDirectory() or FileSystem.exists() passing that same file. Using java.io.File also makes correct operation on Windows difficult, since java.io.File operates differently on Windows in order to accomodate Windows path names. For example, new File("/foo") is not absolute on Windows, and prints its path as "\\foo", which causes confusion.
    To fix this we could replace the uses of java.io.File in the FileSystem API with String, a new FileName class, or perhaps java.net.URI. The advantage of URI is that it can also naturally include the namenode host and port. The disadvantage is that URI does not support tree operations like getParent().
    This change will cause a lot of incompatibility. Thus it should probably be made early in a development cycle in order to maximize the time for folks to adapt to it.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedApr 10, '06 at 7:06p
activeApr 18, '06 at 5:08p
posts11
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase