Thanks for the quick responses! B/t these posts (distcp, dfs -cp and dfs -put) I should be able to figure it out.
----- Original Message ----
From: Ted Dunning <email@example.com>
Sent: Wednesday, October 31, 2007 5:48:54 PM
Subject: Re: multiple file -put in dfs
This only handles the problem of putting lots of files. It doesn't
with putting files in parallel (at once).
This is a ticklish problem since even on a relatively small cluster,
a higher read speed than most storage can read. That means that you
swamp things pretty easily.
When I have files on a single source machine, I just spawn multiple
on sub-directories until I have sufficiently saturated the read speed
source. If all of the cluster members have access to a universal file
system, then you can use the (undocumented) pdist command, but I don't
that as much.
You also have to watch out if you start writing from a host in your
else you will wind up with odd imbalances in file storage. In my case,
source of the data is actually outside of the cluster and I get pretty
If you do wind up with bad balancing, the best option I have seen is to
increase the replication on individual files for 30-60 seconds and then
decrease it again. In order to get sufficient throughput for the
rebalancing, I pipeline lots of these changes so that I have 10-100
a time with higher replication. This does tend to substantially
the number of files with excess replication, but that corrects itself
On 10/31/07 1:53 PM, "Aaron Kimball" wrote:
hadoop dfs -put will take a directory. If it won't work recursively,
then you can probably bang out a bash script that will handle it using
find(1) and xargs(1).
Chris Fellows wrote:
Quick simple question, hopefully someone out there could answer.
Does the hadoop dfs support putting multiple files at once?
The documentation says -put only works on one file. What's the best
import multiple files in multiple directories (i.e. dir1/file1
dir2/file1 dir2/file2 etc)?
End goal would be to do something like:
bin/hadoop dfs -put /dir*/file* /myfiles
And a follow-up: bin/hadoop dfs -lsr /myfiles
Thanks again for any input!!!