FAQ
Hi,

I need to copy data from S3 to HDFS. This instruction

bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/logs logs

does not seem to work.

Thank you.

Search Discussions

  • Tom White at Nov 25, 2009 at 5:20 am
    Mark,

    If the data was transferred to S3 outside of Hadoop then you should
    use the s3n filesystem scheme (see the explanation on
    http://wiki.apache.org/hadoop/AmazonS3 for the differences between the
    Hadoop S3 filesystems).

    Also, some people have had problems embedding the secret key in the
    URI, so you can set it in the configuration as follows:

    <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>ID</value>
    </property>

    <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>SECRET</value>
    </property>

    Then use a URI of the form s3n://<BUCKET>/path/to/logs

    Cheers,
    Tom
    On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner wrote:
    Hi,

    I need to copy data from S3 to HDFS. This instruction

    bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/logs logs

    does not seem to work.

    Thank you.
  • Mark Kerzner at Nov 25, 2009 at 5:27 am
    Yes, Tom, I saw all these problems. I think that I should stop trying to
    imitate EMR - that's where the storing data on S3 appeared, and transfer
    data directly to the Hadoop cluster. Then I will be using all as intended.

    Is there a way to scp directly to the HDFS, or do I need to scp to local
    storage on some machine, and then - to HDFS? Also, is there a way to make
    the master a bigger instance than that of the slaves?

    Thank you,
    Mark
    On Tue, Nov 24, 2009 at 11:20 PM, Tom White wrote:

    Mark,

    If the data was transferred to S3 outside of Hadoop then you should
    use the s3n filesystem scheme (see the explanation on
    http://wiki.apache.org/hadoop/AmazonS3 for the differences between the
    Hadoop S3 filesystems).

    Also, some people have had problems embedding the secret key in the
    URI, so you can set it in the configuration as follows:

    <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>ID</value>
    </property>

    <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>SECRET</value>
    </property>

    Then use a URI of the form s3n://<BUCKET>/path/to/logs

    Cheers,
    Tom
    On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner wrote:
    Hi,

    I need to copy data from S3 to HDFS. This instruction

    bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/logs logs

    does not seem to work.

    Thank you.
  • Tom White at Nov 25, 2009 at 6:25 pm

    On Tue, Nov 24, 2009 at 9:27 PM, Mark Kerzner wrote:
    Yes, Tom, I saw all these problems. I think that I should stop trying to
    imitate EMR - that's where the storing data on S3 appeared, and transfer
    data directly to the Hadoop cluster. Then I will be using all as intended.

    Is there a way to scp directly to the HDFS, or do I need to scp to local
    storage on some machine, and then - to HDFS?
    distcp is the appropriate tool for this. There is some guidance on
    http://wiki.apache.org/hadoop/AmazonS3.
    Also, is there a way to make
    the master a bigger instance than that of the slaves?
    No, this is not supported, but I can see it would be useful,
    particularly for larger clusters. Please consider opening a JIRA for
    it.

    Cheers,
    Tom
    Thank you,
    Mark
    On Tue, Nov 24, 2009 at 11:20 PM, Tom White wrote:

    Mark,

    If the data was transferred to S3 outside of Hadoop then you should
    use the s3n filesystem scheme (see the explanation on
    http://wiki.apache.org/hadoop/AmazonS3 for the differences between the
    Hadoop S3 filesystems).

    Also, some people have had problems embedding the secret key in the
    URI, so you can set it in the configuration as follows:

    <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>ID</value>
    </property>

    <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>SECRET</value>
    </property>

    Then use a URI of the form s3n://<BUCKET>/path/to/logs

    Cheers,
    Tom

    On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I need to copy data from S3 to HDFS. This instruction

    bin/hadoop distcp s3://<ID>:<SECRET>@<BUCKET>/path/to/logs logs

    does not seem to work.

    Thank you.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 25, '09 at 1:48a
activeNov 25, '09 at 6:25p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Tom White: 2 posts Mark Kerzner: 2 posts

People

Translate

site design / logo © 2022 Grokbase