FAQ
Hi,

I have a 20-node hadoop cluster, processing large log files. I've seen it
said that there's never any reason to make the inputSplitSize larger than a
single HDFS block (64M), because you give up data locality for no benefit if
you do.

But when I kick off a job against the whole dataset with that default
splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
each. Typically I can get through about half of them, then the jobTracker
freezes with OOM errors.

I do realize that I could just up the HADOOP_HEAP_SIZE on the jobTracker
node. But it also seems like we ought to have fewer map tasks, lasting more
like 1 or 1.5 minutes each, to reduce the overhead to the jobTracker of
managing so many tasks...also the overhead to the cluster nodes of starting
and cleaning up after so many child JVMs.

Is that not a compelling reason for upping the inputSplitSize? Or am I
missing something?

Thanks

Search Discussions

  • Harsh J at Mar 29, 2011 at 4:40 pm
    Hello Brendan W.,
    On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. wrote:
    Hi,

    I have a 20-node hadoop cluster, processing large log files.  I've seen it
    said that there's never any reason to make the inputSplitSize larger than a
    single HDFS block (64M), because you give up data locality for no benefit if
    you do.
    This is true. You wouldn't always want your InputSplits to have chunk
    sizes bigger than the input file's block-size on the HDFS.
    But when I kick off a job against the whole dataset with that default
    splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
    each.  Typically I can get through about half of them, then the jobTracker
    freezes with OOM errors.
    That your input splits are 180000 in number is a good indicator that you have:
    a) Too many files (A small files problem? [1])
    b) Too low block size of the input files [2])

    [1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/
    [2] - For file sizes in GBs, it does not make sense to have 64 MB
    block sizes. Increasing block sizes for such files (it is a per-file
    property after-all) directly reduces your number of tasks.
  • Brendan W. at Mar 29, 2011 at 9:13 pm
    Thanks Harsh...it's definitely (2) below, i.e., giant files.

    But what would be the benefit of actually changing the DFS block size (to
    say N*64 Mbytes), as opposed to just increasing the inputSplitSize to N
    64-Mbyte blocks for my job? Both will reduce my number of mappers by a
    factor of N, right? Any benefit to one over the other?
    On Tue, Mar 29, 2011 at 12:39 PM, Harsh J wrote:

    Hello Brendan W.,
    On Tue, Mar 29, 2011 at 9:01 PM, Brendan W. wrote:
    Hi,

    I have a 20-node hadoop cluster, processing large log files. I've seen it
    said that there's never any reason to make the inputSplitSize larger than a
    single HDFS block (64M), because you give up data locality for no benefit if
    you do.
    This is true. You wouldn't always want your InputSplits to have chunk
    sizes bigger than the input file's block-size on the HDFS.
    But when I kick off a job against the whole dataset with that default
    splitSize, I get about 180,000 map tasks, most lasting about 9-15 seconds
    each. Typically I can get through about half of them, then the
    jobTracker
    freezes with OOM errors.
    That your input splits are 180000 in number is a good indicator that you
    have:
    a) Too many files (A small files problem? [1])
    b) Too low block size of the input files [2])

    [1] - http://www.cloudera.com/blog/2009/02/the-small-files-problem/
    [2] - For file sizes in GBs, it does not make sense to have 64 MB
    block sizes. Increasing block sizes for such files (it is a per-file
    property after-all) directly reduces your number of tasks.

    --
    Harsh J
    http://harshj.com
  • Harsh J at Mar 30, 2011 at 4:50 am
    Hello again,
    On Wed, Mar 30, 2011 at 2:42 AM, Brendan W. wrote:
    But what would be the benefit of actually changing the DFS block size (to
    say N*64 Mbytes), as opposed to just increasing the inputSplitSize to N
    64-Mbyte blocks for my job?  Both will reduce my number of mappers by a
    factor of N, right?  Any benefit to one over the other?
    You'll have a data locality benefit if you choose to change the block
    size of the files itself, instead of going for 'logical' splitting by
    the framework. This should save you a great deal of network transfer
    costs in your job.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 29, '11 at 3:31p
activeMar 30, '11 at 4:50a
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Brendan W.: 2 posts Harsh J: 2 posts

People

Translate

site design / logo © 2022 Grokbase