FAQ
want FileSystem implementation for Amazon S3
--------------------------------------------

Key: HADOOP-574
URL: http://issues.apache.org/jira/browse/HADOOP-574
Project: Hadoop
Issue Type: New Feature
Components: fs
Reporter: Doug Cutting


An S3-based Hadoop FileSystem would make a great addition to Hadoop.

It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:

http://www.mail-archive.com/[email protected]/msg00318.html

This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Search Discussions

  • Doug Cutting (JIRA) at Oct 28, 2006 at 3:37 am
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12445322 ]

    Doug Cutting commented on HADOOP-574:
    -------------------------------------

    Hadoop's wiki now includes a page describing how to use Hadoop on Amazon EC2:

    http://wiki.apache.org/lucene-hadoop/AmazonEC2

    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Reporter: Doug Cutting

    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 10, 2006 at 2:11 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12448741 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    I've started work on this using a simple block design that borrows heavily from DFS. (I did a similar thing for the Java Map interface - although it was simpler - http://weblogs.java.net/blog/tomwhite/archive/2006/08/s3map.html) I should have something to show next week for discussion and review.

    Any suggestions on which package should this go in? org.hadoop.s3fs or org.hadoop.fs.s3, or something else?
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Reporter: Doug Cutting

    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Nov 10, 2006 at 5:49 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12448810 ]

    Doug Cutting commented on HADOOP-574:
    -------------------------------------
    I've started work
    Great! Jim Kellerman (jim at powerset dot com) has also privately expressed interest in implementing this. Perhaps you can collaborate?
    which package should this go in? org.hadoop.s3fs or org.hadoop.fs.s3, or something else?
    Let's go with org.apache.hadoop.fs.s3. Someday we should probably move dfs into a subpackage of fs...

    Are you going to attack HADOOP-571 too, or is anyone else working on that? Otherwise it will be hard to configure a cluster whose default filesystem is S3, or to submit a MapReduce job whose input or output use a non-default filesystem.

    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Reporter: Doug Cutting

    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Nov 10, 2006 at 5:51 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12448811 ]

    Doug Cutting commented on HADOOP-574:
    -------------------------------------

    Here're some thoughts I sent Jim about this:

    DFS stores files as a sequence of ~100MB blocks. I think a scheme like this will be useful for an S3-based FileSystem too.

    When creating, each DFS block is first written locally to a temporary file, and, only when the block is full (or the file is closed) is the block actually written to DFS. This is instead of trying to trickle things to the network as they're written, which can run into timeout issues, etc. It also means that when a block write fails it can be easily retried.

    Very large files (up to a terabyte) should be supported. Breaking things into blocks should help here too. S3 limits an object value to 5GB. So each file can be represented as a set of ~100MB S3 object values. The set can be listed when the file is opened and used to guide seeks and reads of the data. The block number can be placed at the end of the name using a delimiter, so that access to metadata is not required when opening files or listing directories.

    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Reporter: Doug Cutting

    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 12, 2006 at 10:12 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12449188 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    Thanks Doug. Collaboration sounds good: I'll contact Jim directly.

    Regarding HADOOP-571 I agree it makes sense to tackle this in conjunction. I'll have a look at it after we get the basics of the S3 filesystem working.

    As far as the design goes I agree that (like DFS) the S3 filesystem should divide things into blocks and buffer them to disk before writing them to S3. I'm not sure about using putting the block number at the end of the filename (using a delimiter) since this makes renames very inefficient as S3 has no rename operation. Instead I have opted for a level of indirection whereby the S3 object at the filename is a metadata file which lists the block IDs that hold the data. A rename then is simply a re-PUT of the metadata. What do you think?

    The other aspect I haven't put much thought into yet is locking. Keeping the number of HTTP requests to a minimum will be an interesting challenge.

    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Reporter: Doug Cutting

    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Nov 13, 2006 at 7:23 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12449457 ]

    Doug Cutting commented on HADOOP-574:
    -------------------------------------
    The other aspect I haven't put much thought into yet is locking.
    I don't think locking is critical for for the first version. MapReduce jobs shouldn't generally require locking. Longer term we should probably consider adding a locking service to Hadoop directly, rather than relying on one in the filesystem implementation.

    If no one else steps up to the plate in the next day or so then I'll start implementing HADOOP-571 (uri-based FileSystem Paths). I'd like to get that into this month's release if possible.

    I also added a related thread to Amazon's EC2 forum.

    http://developer.amazonwebservices.com/connect/thread.jspa?threadID=12677

    When you get something working, please attach it as a patch to this issue. Thanks!

    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Reporter: Doug Cutting

    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 14, 2006 at 10:14 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Tom White updated HADOOP-574:
    -----------------------------

    Status: Patch Available (was: Open)
    Affects Version/s: 0.9.0

    Please find attached a patch for an S3 filesystem for review and discussion. It is not yet complete so I am not expecting it to be committed.
    There are a number of TODO comments in the code that list the remaining work items. These mainly concern corner cases or tidying up work - all the basic file system operations work.

    The code has been exercised by unit tests. To get these to run you need to fill in S3 access codes in the test hadoop-site.xml. (This raises an interesting question of how to have these run as automated tests.)

    I have not tried running Hadoop (or something like DfsShell) with this filesystem. We probably need to wait until HADOOP-571 is done.

    Looking forward to your feedback!
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting

    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 14, 2006 at 10:16 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Tom White updated HADOOP-574:
    -----------------------------

    Attachment: HADOOP-574.patch
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 14, 2006 at 10:28 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12449831 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    I forgot to mention that the code relies on the (Apache licensed) jets3t S3 toolkit (https://jets3t.dev.java.net/), which in turn requires commons-codec-1.3.jar and commons-httpclient-3.0.1.jar. (I wasn't sure whether I should attach these dependencies or not.)
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Nov 15, 2006 at 8:03 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Doug Cutting updated HADOOP-574:
    --------------------------------

    Status: Open (was: Patch Available)

    Yes, please also attach the required jar files: that will save me having to find them!

    At a glance, the code looks great! It's unfortunate that one needs an account to run the unit tests. That means we can't require the tests in the mainline test task, since we don't want to require every developer to have an S3 account. However we can still run them as a part of the nightly task, using, e.g., my S3 account. Or maybe we could have these tests just print a warning when the account properites are undefined, but still succeed, so the default build is not broken?

    Now I need to get cracking on HADOOP-571...

    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 15, 2006 at 8:31 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Tom White updated HADOOP-574:
    -----------------------------

    Attachment: dependencies.zip

    Attached is a zip with the required dependencies.

    Regarding the unit testing, it is a pity that we require an account (and it costs money every time you run them!). However, I plan to write a stub implementation of FileSystemStore that can go in the mainline tests. We will still need a nightly test against the real S3 though. Or perhaps we could use ParkPlace (an S3 clone written in Ruby), although I don't know how faithful it is to Amazon...
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Nov 16, 2006 at 6:33 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12450482 ]

    Doug Cutting commented on HADOOP-574:
    -------------------------------------

    That sounds like a good plan: by default unit tests should not attempt to connect to S3. But it should be easy to configure things so that S3 is used, and we can do that in the nightlies (provided my wife approves the expense!).
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 21, 2006 at 8:12 am
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12451572 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    I've been looking at Jim Kellerman's suggestion of using a URL-safe Base64 encoding for path names since they are more compact than regular URL encoding. (An S3 key is a maximum of 1024 bytes.) See http://www.faqs.org/rfcs/rfc3548.html and http://www.faqs.org/qa/rfcc-1940.html for details. Jim has implemented these algorithms in a public domain Base64 encoding package
    (http://iharder.sourceforge.net/current/java/base64/, version 2.2).

    However, I don't think Base 64 encoding is compatible with the delimiter request parameter for S3 bucket listing since base 64 encoding doesn't preserve byte boundaries, so it is not possible to search for a substring in the base 64 encoded representation of some text. The current S3 FileSystem code uses the delimiter request parameter as an efficient way to implement the listPathsRaw method. Therefore I think it is probably best to stick with the current URL encoding solution.
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 26, 2006 at 9:03 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Tom White updated HADOOP-574:
    -----------------------------

    Attachment: HADOOP-574-v2.patch

    Please find attached an updated patch that includes a stub implementation of FileSystemStore for the unit tests. The nightly tests should be the only ones that run against S3 (as discussed), but I haven't changed any config to actually make this happen.

    The patch also includes updates for many of the remaining TODOs, which were mainly testing and implementing edge cases.
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 26, 2006 at 9:33 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12453453 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    As an experiment I modified FileSystem.getNamed() to include a check for s3 (note that this change is not in HADOOP-574-v2.patch):

    if ("local".equals(name)) {
    fs = new LocalFileSystem(conf);
    } else if ("s3".equals(name)) {
    fs = new S3FileSystem(conf);
    } else {
    fs = new DistributedFileSystem(DataNode.createSocketAddr(name), conf);
    }

    I then built a hadoop distribution, unpacked, created a hadoop-site.xml wih my S3 keys and bucket name and ran the following in the bin directory:

    ./hadoop dfs -conf hadoop-site.xml -fs s3 -mkdir /tmp/tom
    ./hadoop dfs -conf hadoop-site.xml -fs s3 -copyFromLocal s3test.txt /tmp/tom/s3test.txt
    ./hadoop dfs -conf hadoop-site.xml -fs s3 -copyToLocal /tmp/tom/s3test.txt s3test.copy.txt
    diff s3test.copy.txt s3test.txt
    ./hadoop dfs -conf hadoop-site.xml -fs s3 -rm /tmp/tom/s3test.txt
    ./hadoop dfs -conf hadoop-site.xml -fs s3 -rmr /tmp/tom

    All the commands succeeded and the diff showed that the files were the same. This was a great sanity check!

    This suggests to me that DFSShell is really more general than DFS - perhaps it should be renamed FSShell?

    Next steps - I guess I need to see how the S3 implementation fits with HADOOP-571. Suggestions welcome.
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Nov 27, 2006 at 11:27 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12453742 ]

    Doug Cutting commented on HADOOP-574:
    -------------------------------------
    DFSShell is really more general than DFS - perhaps it should be renamed FSShell?
    Yes, it probably should, since it is largely generic.
    I need to see how the S3 implementation fits with HADOOP-571. Suggestions welcome.
    What should the URI look like? The scheme should be "s3", that's easy, but what about the authority? I think it makes sense to map bucket to host, access key id to user and secret access key to password. So that would give URIs like s3://id:[email protected]/. One's hadoop-site.xml could then just specify the fs.default.name using this syntax.

    Other than that, all you should need to do is define a few new FileSystem methods (getUri(), initialize(URI, Configuration)) and a new configuration property (fs.s3.impl).
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • stack@archive.org (JIRA) at Nov 29, 2006 at 9:03 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12454434 ]

    [email protected] commented on HADOOP-574:
    ------------------------------------------

    +1 on scheme.

    Regards mixing in of access id and access secret, users will need to be a careful with job input files that list s3 files by URI. On bucket as host, its probably safe to say the s3 hostname, s3.amazonaws.com, isn't likely to change any time soon. And if ever there was to be an s3 filesystem that went over SSL, its scheme might be s3s?
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 30, 2006 at 3:59 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12454681 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    I like the proposal and started to implement it, but then I discovered that my secret key contains '/' characters (two in fact), which makes it awkward to turn into a URI, since you have to manually escape them.

    While this is doable, it seems to me that it makes it harder to configure, and could be a source of interesting support questions. I haven't really got a better alternative though.


    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • James P. White (JIRA) at Nov 30, 2006 at 5:02 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12454699 ]

    James P. White commented on HADOOP-574:
    ---------------------------------------

    What about putting the secret in a keystore, or at least some configuration file?

    That also opens up the possibility of supporting a default id, which would enable folks to set up shared/reusable jobs.
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Nov 30, 2006 at 9:09 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12454753 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    Thinking about this more perhaps the best thing would be to put the access key and secret access key into the configuration file (as implemented in the first two patches, the properties are fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey), but keep the bucket name as a part of the URI: s3://bucket/.

    As an enhancement we might allow the keys to be specified in the URI (per Doug's original suggestion) if people wanted to put them there (escaped). These would take precedence over keys specified in the config file.
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • James P. White (JIRA) at Nov 30, 2006 at 10:15 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12454765 ]

    James P. White commented on HADOOP-574:
    ---------------------------------------

    I was thinking that you would retain the parsing of the id and secret from the URI, but that the secret would be found by id in the keystore/configuration file if omitted.

    The same approach would apply to the id if omitted, as the configuration file could supply a default id (fs.s3.awsAccessKeyId), which would then lead to the secret.

    One way to do that would be to use the configuration file and put the id in the property name:

    fs.s3.awsSecretAccessKey.<myid>
    fs.s3.awsSecretAccessKey.<myfriendsid>

    That ultimately would be compatible with the old schme if fs.s3.awsSecretAccessKey were the default secret if not otherwise specified (effectively fs.s3.awsSecretAccessKey.*).

    The only reservation I have with putting the secret in the configuration file is I don't like using plain text for secrets. An encrypted store would be nicer. Supporting OpenSSH Agents would be very nice too.

    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Dec 1, 2006 at 10:42 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Doug Cutting updated HADOOP-574:
    --------------------------------

    Affects Version/s: 0.10.0
    (was: 0.9.0)
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Fix For: 0.10.0

    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Dec 1, 2006 at 10:42 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Doug Cutting updated HADOOP-574:
    --------------------------------

    Fix Version/s: 0.10.0
    Affects Version/s: 0.9.0
    (was: 0.10.0)
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Fix For: 0.10.0

    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Dec 12, 2006 at 9:38 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=comments#action_12457889 ]

    Tom White commented on HADOOP-574:
    ----------------------------------

    I've now got a version of this that works with HADOOP-571 patch 5. You can specify access key id and secret access key in the URI (Doug's suggestion), or you can specify them in hadoop-site.xml. I haven't implemented support for multiple keys or encrypted keys (James' suggestion), but don't see why this couldn't be added in the future.

    I'll create a patch after HADOOP-571 is committed.
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Fix For: 0.10.0

    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Tom White (JIRA) at Dec 13, 2006 at 9:47 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Tom White updated HADOOP-574:
    -----------------------------

    Attachment: HADOOP-574-v3.patch

    Here is a version that works with the changes made in HADOOP-571 that are now in trunk.

    The unit test TestJets3tS3FileSystem runs against S3 and requires an S3 account - it should be excluded from the mainline tests and run nightly perhaps (I have not made the changes to the build file to do this). TestInMemoryS3FileSystem should stay in the mainline tests.

    There is some further work to make the implementation more robust (by implementing retries), but I believe the code is ready to be committed.
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Fix For: 0.10.0

    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574-v3.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Doug Cutting (JIRA) at Dec 13, 2006 at 11:15 pm
    [ http://issues.apache.org/jira/browse/HADOOP-574?page=all ]

    Doug Cutting resolved HADOOP-574.
    ---------------------------------

    Resolution: Fixed

    I just committed this. Thanks, Tom, it looks great!
    want FileSystem implementation for Amazon S3
    --------------------------------------------

    Key: HADOOP-574
    URL: http://issues.apache.org/jira/browse/HADOOP-574
    Project: Hadoop
    Issue Type: New Feature
    Components: fs
    Affects Versions: 0.9.0
    Reporter: Doug Cutting
    Fix For: 0.10.0

    Attachments: dependencies.zip, HADOOP-574-v2.patch, HADOOP-574-v3.patch, HADOOP-574.patch


    An S3-based Hadoop FileSystem would make a great addition to Hadoop.
    It would facillitate use of Hadoop on Amazon's EC2 computing grid, as discussed here:
    http://www.mail-archive.com/[email protected]/msg00318.html
    This is related to HADOOP-571, which would make Hadoop's FileSystem considerably easier to extend.
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedOct 4, '06 at 4:45p
activeDec 13, '06 at 11:15p
posts27
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Doug Cutting (JIRA): 27 posts

People

Translate

site design / logo © 2023 Grokbase