One thing I had done to speed up copy/put speeds was write a simple
map-reduce job to do parallel copies of files from a input directory (in
our case the input directory is nfs mounted from all task nodes). It
gives us a huge speed-bump.
It's trivial to roll ur own - but would be happy to share as well.
From: C G
Sent: Friday, August 31, 2007 11:21 AM
Subject: RE: Compression using Hadoop...
My input is typical row-based stuff across which are run a large stack
of aggregations/rollups. After reading earlier posts on this thread, I
modified my loader to split the input up into 1M row partitions
(literally gunzip -cd input.gz | split...). I then ran an experiment
using 50M rows (i.e. 50 gz files loaded into HDFS) on a 8 node cluster.
Ted, from what you are saying I should be using at least 80 files given
the cluster size, and I should modify the loader to be aware of the
number of nodes and split accordingly. Do you concur?
Load time to HDFS may be the next challenge. My HDFS configuration on
8 nodes uses a replication factor of 3. Sequentially copying my data to
HDFS using -copyFromLocal took 23 minutes to move 266M in individual
files of 5.7M each. Does anybody find this result surprising? Note
that this is on EC2, where there is no such thing as rack-level or
switch-level locality. Should I expect dramatically better performance
on a real iron? Once I get this prototyping/education under my belt my
plan is to deploy a 64 node grid of 4 way machines with a terabyte of
local storage on each node.
Thanks for the discussion...the Hadoop community is very helpful!
Ted Dunning wrote:
They will only be a non-issue if you have enough of them to get the
parallelism you want. If you have number of gzip files > 10*number of
task nodes you should be fine.
From: email@example.com on behalf of jason gessner
Sent: Fri 8/31/2007 9:38 AM
Subject: Re: Compression using Hadoop...
ted, will the gzip files be a non-issue as far as splitting goes if
they are under the default block size?
C G, glad i could help a little.
On 8/31/07, C G
Thanks Ted and Jason for your comments. Ted, your comments about gzip
not being splittable was very timely...I'm watching my 8 node cluster
saturate one node (with one gz file) and was wondering why. Thanks for
the "answer in advance" :-).
Ted Dunning wrote:
With gzipped files, you do face the problem that your parallelism in the map
phase is pretty much limited to the number of files you have (because
gzip'ed files aren't splittable). This is often not a problem since most
people can arrange to have dozens to hundreds of input files easier than
they can arrange to have dozens to hundreds of CPU cores working on their
Luggage? GPS? Comic books?
Check out fitting gifts for grads at Yahoo! Search.