FAQ
Hello!

Quick simple question, hopefully someone out there could answer.

Does the hadoop dfs support putting multiple files at once?

The documentation says -put only works on one file. What's the best way to import multiple files in multiple directories (i.e. dir1/file1 dir1/file2 dir2/file1 dir2/file2 etc)?

End goal would be to do something like:

bin/hadoop dfs -put /dir*/file* /myfiles

And a follow-up: bin/hadoop dfs -lsr /myfiles
would list:

/myfiles/dir1/file1
/myfiles/dir1/file2
/myfiles/dir2/file1
/myfiles/dir2/file2

Thanks again for any input!!!

- chris

Search Discussions

  • Aaron Kimball at Oct 31, 2007 at 8:55 pm
    hadoop dfs -put will take a directory. If it won't work recursively,
    then you can probably bang out a bash script that will handle it using
    find(1) and xargs(1).

    -- Aaron

    Chris Fellows wrote:
    Hello!

    Quick simple question, hopefully someone out there could answer.

    Does the hadoop dfs support putting multiple files at once?

    The documentation says -put only works on one file. What's the best way to import multiple files in multiple directories (i.e. dir1/file1 dir1/file2 dir2/file1 dir2/file2 etc)?

    End goal would be to do something like:

    bin/hadoop dfs -put /dir*/file* /myfiles

    And a follow-up: bin/hadoop dfs -lsr /myfiles
    would list:

    /myfiles/dir1/file1
    /myfiles/dir1/file2
    /myfiles/dir2/file1
    /myfiles/dir2/file2

    Thanks again for any input!!!

    - chris

  • Ted Dunning at Oct 31, 2007 at 9:49 pm
    This only handles the problem of putting lots of files. It doesn't deal
    with putting files in parallel (at once).

    This is a ticklish problem since even on a relatively small cluster, dfs has
    a higher read speed than most storage can read. That means that you can
    swamp things pretty easily.

    When I have files on a single source machine, I just spawn multiple -put's
    on sub-directories until I have sufficiently saturated the read speed of the
    source. If all of the cluster members have access to a universal file
    system, then you can use the (undocumented) pdist command, but I don't like
    that as much.

    You also have to watch out if you start writing from a host in your cluster
    else you will wind up with odd imbalances in file storage. In my case, the
    source of the data is actually outside of the cluster and I get pretty good
    balancing.

    If you do wind up with bad balancing, the best option I have seen is to
    increase the replication on individual files for 30-60 seconds and then
    decrease it again. In order to get sufficient throughput for the
    rebalancing, I pipeline lots of these changes so that I have 10-100 files at
    a time with higher replication. This does tend to substantially increase
    the number of files with excess replication, but that corrects itself pretty
    quickly.

    On 10/31/07 1:53 PM, "Aaron Kimball" wrote:

    hadoop dfs -put will take a directory. If it won't work recursively,
    then you can probably bang out a bash script that will handle it using
    find(1) and xargs(1).

    -- Aaron

    Chris Fellows wrote:
    Hello!

    Quick simple question, hopefully someone out there could answer.

    Does the hadoop dfs support putting multiple files at once?

    The documentation says -put only works on one file. What's the best way to
    import multiple files in multiple directories (i.e. dir1/file1 dir1/file2
    dir2/file1 dir2/file2 etc)?

    End goal would be to do something like:

    bin/hadoop dfs -put /dir*/file* /myfiles

    And a follow-up: bin/hadoop dfs -lsr /myfiles
    would list:

    /myfiles/dir1/file1
    /myfiles/dir1/file2
    /myfiles/dir2/file1
    /myfiles/dir2/file2

    Thanks again for any input!!!

    - chris

  • Chris Fellows at Oct 31, 2007 at 10:01 pm
    Thanks for the quick responses! B/t these posts (distcp, dfs -cp and dfs -put) I should be able to figure it out.

    ----- Original Message ----
    From: Ted Dunning <tdunning@veoh.com>
    To: hadoop-user@lucene.apache.org
    Sent: Wednesday, October 31, 2007 5:48:54 PM
    Subject: Re: multiple file -put in dfs



    This only handles the problem of putting lots of files. It doesn't
    deal
    with putting files in parallel (at once).

    This is a ticklish problem since even on a relatively small cluster,
    dfs has
    a higher read speed than most storage can read. That means that you
    can
    swamp things pretty easily.

    When I have files on a single source machine, I just spawn multiple
    -put's
    on sub-directories until I have sufficiently saturated the read speed
    of the
    source. If all of the cluster members have access to a universal file
    system, then you can use the (undocumented) pdist command, but I don't
    like
    that as much.

    You also have to watch out if you start writing from a host in your
    cluster
    else you will wind up with odd imbalances in file storage. In my case,
    the
    source of the data is actually outside of the cluster and I get pretty
    good
    balancing.

    If you do wind up with bad balancing, the best option I have seen is to
    increase the replication on individual files for 30-60 seconds and then
    decrease it again. In order to get sufficient throughput for the
    rebalancing, I pipeline lots of these changes so that I have 10-100
    files at
    a time with higher replication. This does tend to substantially
    increase
    the number of files with excess replication, but that corrects itself
    pretty
    quickly.

    On 10/31/07 1:53 PM, "Aaron Kimball" wrote:

    hadoop dfs -put will take a directory. If it won't work recursively,
    then you can probably bang out a bash script that will handle it using
    find(1) and xargs(1).

    -- Aaron

    Chris Fellows wrote:
    Hello!

    Quick simple question, hopefully someone out there could answer.

    Does the hadoop dfs support putting multiple files at once?

    The documentation says -put only works on one file. What's the best
    way to
    import multiple files in multiple directories (i.e. dir1/file1
    dir1/file2
    dir2/file1 dir2/file2 etc)?

    End goal would be to do something like:

    bin/hadoop dfs -put /dir*/file* /myfiles

    And a follow-up: bin/hadoop dfs -lsr /myfiles
    would list:

    /myfiles/dir1/file1
    /myfiles/dir1/file2
    /myfiles/dir2/file1
    /myfiles/dir2/file2

    Thanks again for any input!!!

    - chris

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 31, '07 at 8:50p
activeOct 31, '07 at 10:01p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase