FAQ

CDH4 Hive 0.8 - Job failed 'Split metadata size exceeded 10000000'

V v
Mar 22, 2013 at 9:51 pm
I got this fixed by increasing mapreduce.jobtracker.split.metainfo.maxsize.

in hive,
set mapreduce.jobtracker.split.metainfo.maxsize=1000000000

BUT

My understanding is, 'HDFS cuts off the whole set of input files into
slices named "splits", and stores them to each nodes with its metadata. But
the limit of metadata split count is set by - the property
"mapreduce.jobtracker.split.metainfo.maxsize" and its default value is 10
million.

I have only 5600 files, each of 25 mb files (yes, i have 5600 mappers).
Total size of this is around 150GB. (Also, we use 64MB as block size) Now
why is this 10 million default being hit with this data set?

Am i missing something or doing something wrong?


----- ERROR
Job initialization failed: java.io.IOException: Split metadata size
exceeded 10000000. Aborting job job_201303131859_7293 at
org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at
org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:815)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:709)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4043) at
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


--
reply

Search Discussions

1 response

  • Harsh J at Mar 23, 2013 at 5:46 am
    The 10 million is not entries, but a byte value. If the input set is
    large in number of files, then the file's path also counts as file
    bytes expense. If the total split meta info written is over the
    administered limit, the error is thrown.

    The limit is designed to ward off jobs that either try to abuse the
    split-load system at the JT (in order to crash it down) and to limit
    user's maximum # of map tasks (which arises out of a bad practice of
    storing lots of smaller files, which hurts processing in every context
    due to its added overheads).
    On Sat, Mar 23, 2013 at 3:21 AM, v v wrote:
    I got this fixed by increasing mapreduce.jobtracker.split.metainfo.maxsize.

    in hive,
    set mapreduce.jobtracker.split.metainfo.maxsize=1000000000

    BUT

    My understanding is, 'HDFS cuts off the whole set of input files into slices
    named "splits", and stores them to each nodes with its metadata. But the
    limit of metadata split count is set by - the property
    "mapreduce.jobtracker.split.metainfo.maxsize" and its default value is 10
    million.

    I have only 5600 files, each of 25 mb files (yes, i have 5600 mappers).
    Total size of this is around 150GB. (Also, we use 64MB as block size) Now
    why is this 10 million default being hit with this data set?

    Am i missing something or doing something wrong?


    ----- ERROR
    Job initialization failed: java.io.IOException: Split metadata size exceeded
    10000000. Aborting job job_201303131859_7293 at
    org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
    at
    org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:815)
    at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:709)
    at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4043) at
    org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
    at
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)


    --



    --
    Harsh J

    --

Related Discussions

Discussion Navigation
viewthread | post

2 users in discussion

V v: 1 post Harsh J: 1 post