FAQ
Hi,

I ran a Hadoop MapReduce task in the local mode, reading and writing from
HDFS, and it took 2.5 minutes. Essentially the same operations on the local
file system without MapReduce took 1/2 minute. Is this to be expected?

It seemed that the system lost most of the time in the MapReduce operation,
such as after these messages

09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
09/04/19 23:23:01 INFO mapred.JobClient: map 100% reduce 92%
09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

it waited for a long time. The final output lines were

09/04/19 23:24:12 INFO mapred.LocalJobRunner: reduce > reduce
09/04/19 23:24:12 INFO mapred.TaskRunner: Task
'attempt_local_0001_r_000000_0' done.
09/04/19 23:24:12 INFO mapred.TaskRunner: Saved output of task
'attempt_local_0001_r_000000_0' to hdfs://localhost/output
09/04/19 23:24:13 INFO mapred.JobClient: Job complete: job_local_0001
09/04/19 23:24:13 INFO mapred.JobClient: Counters: 13
09/04/19 23:24:13 INFO mapred.JobClient: File Systems
09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes read=138103444
09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes written=107357785
09/04/19 23:24:13 INFO mapred.JobClient: Local bytes read=282509133
09/04/19 23:24:13 INFO mapred.JobClient: Local bytes written=376697552
09/04/19 23:24:13 INFO mapred.JobClient: Map-Reduce Framework
09/04/19 23:24:13 INFO mapred.JobClient: Reduce input groups=184
09/04/19 23:24:13 INFO mapred.JobClient: Combine output records=185
09/04/19 23:24:13 INFO mapred.JobClient: Map input records=209
09/04/19 23:24:13 INFO mapred.JobClient: Reduce output records=184
09/04/19 23:24:13 INFO mapred.JobClient: Map output bytes=91863989
09/04/19 23:24:13 INFO mapred.JobClient: Map input bytes=69051592
09/04/19 23:24:13 INFO mapred.JobClient: Combine input records=185
09/04/19 23:24:13 INFO mapred.JobClient: Map output records=209
09/04/19 23:24:13 INFO mapred.JobClient: Reduce input records=184

Search Discussions

  • Jean-Daniel Cryans at Apr 20, 2009 at 12:42 pm
    Mark,

    There is a setup price when using Hadoop, for each task a new JVM must
    be spawned. On such a small scale, you won't see any good using MR.

    J-D
    On Mon, Apr 20, 2009 at 12:26 AM, Mark Kerzner wrote:
    Hi,

    I ran a Hadoop MapReduce task in the local mode, reading and writing from
    HDFS, and it took 2.5 minutes. Essentially the same operations on the local
    file system without MapReduce took 1/2 minute.  Is this to be expected?

    It seemed that the system lost most of the time in the MapReduce operation,
    such as after these messages

    09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:23:01 INFO mapred.JobClient:  map 100% reduce 92%
    09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

    it waited for a long time. The final output lines were

    09/04/19 23:24:12 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:24:12 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    09/04/19 23:24:12 INFO mapred.TaskRunner: Saved output of task
    'attempt_local_0001_r_000000_0' to hdfs://localhost/output
    09/04/19 23:24:13 INFO mapred.JobClient: Job complete: job_local_0001
    09/04/19 23:24:13 INFO mapred.JobClient: Counters: 13
    09/04/19 23:24:13 INFO mapred.JobClient:   File Systems
    09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes read=138103444
    09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes written=107357785
    09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes read=282509133
    09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes written=376697552
    09/04/19 23:24:13 INFO mapred.JobClient:   Map-Reduce Framework
    09/04/19 23:24:13 INFO mapred.JobClient:     Reduce input groups=184
    09/04/19 23:24:13 INFO mapred.JobClient:     Combine output records=185
    09/04/19 23:24:13 INFO mapred.JobClient:     Map input records=209
    09/04/19 23:24:13 INFO mapred.JobClient:     Reduce output records=184
    09/04/19 23:24:13 INFO mapred.JobClient:     Map output bytes=91863989
    09/04/19 23:24:13 INFO mapred.JobClient:     Map input bytes=69051592
    09/04/19 23:24:13 INFO mapred.JobClient:     Combine input records=185
    09/04/19 23:24:13 INFO mapred.JobClient:     Map output records=209
    09/04/19 23:24:13 INFO mapred.JobClient:     Reduce input records=184
  • Mark Kerzner at Apr 20, 2009 at 1:22 pm
    Jean-Daniel,
    I realize that, and my question was, is this the normal setup/finishup time,
    about 2 minutes? If it is, then fine. I would expect that on tasks taking
    10-15 minutes, 2 minutes would be totally justified, and I think that this
    is the guideline - each task should take minutes.

    Thank you,
    Mark
    On Mon, Apr 20, 2009 at 7:42 AM, Jean-Daniel Cryans wrote:

    Mark,

    There is a setup price when using Hadoop, for each task a new JVM must
    be spawned. On such a small scale, you won't see any good using MR.

    J-D
    On Mon, Apr 20, 2009 at 12:26 AM, Mark Kerzner wrote:
    Hi,

    I ran a Hadoop MapReduce task in the local mode, reading and writing from
    HDFS, and it took 2.5 minutes. Essentially the same operations on the local
    file system without MapReduce took 1/2 minute. Is this to be expected?

    It seemed that the system lost most of the time in the MapReduce
    operation,
    such as after these messages

    09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:23:01 INFO mapred.JobClient: map 100% reduce 92%
    09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

    it waited for a long time. The final output lines were

    09/04/19 23:24:12 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:24:12 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    09/04/19 23:24:12 INFO mapred.TaskRunner: Saved output of task
    'attempt_local_0001_r_000000_0' to hdfs://localhost/output
    09/04/19 23:24:13 INFO mapred.JobClient: Job complete: job_local_0001
    09/04/19 23:24:13 INFO mapred.JobClient: Counters: 13
    09/04/19 23:24:13 INFO mapred.JobClient: File Systems
    09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes read=138103444
    09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes written=107357785
    09/04/19 23:24:13 INFO mapred.JobClient: Local bytes read=282509133
    09/04/19 23:24:13 INFO mapred.JobClient: Local bytes
    written=376697552
    09/04/19 23:24:13 INFO mapred.JobClient: Map-Reduce Framework
    09/04/19 23:24:13 INFO mapred.JobClient: Reduce input groups=184
    09/04/19 23:24:13 INFO mapred.JobClient: Combine output records=185
    09/04/19 23:24:13 INFO mapred.JobClient: Map input records=209
    09/04/19 23:24:13 INFO mapred.JobClient: Reduce output records=184
    09/04/19 23:24:13 INFO mapred.JobClient: Map output bytes=91863989
    09/04/19 23:24:13 INFO mapred.JobClient: Map input bytes=69051592
    09/04/19 23:24:13 INFO mapred.JobClient: Combine input records=185
    09/04/19 23:24:13 INFO mapred.JobClient: Map output records=209
    09/04/19 23:24:13 INFO mapred.JobClient: Reduce input records=184
  • Jean-Daniel Cryans at Apr 20, 2009 at 1:54 pm
    Mark,

    Oh sorry, yes you should expect that kind of delay. A tip to optimize
    that on big jobs with lots of tasks is to use the
    JobConf.setNumTasksToExecutePerJvm(int numTasks) which sets how many
    times a JVM can be reused (instead of spawning new ones).

    Happy Hadooping!

    J-D
    On Mon, Apr 20, 2009 at 9:22 AM, Mark Kerzner wrote:
    Jean-Daniel,
    I realize that, and my question was, is this the normal setup/finishup time,
    about 2 minutes? If it is, then fine. I would expect that on tasks taking
    10-15 minutes, 2 minutes would be totally justified, and I think that this
    is the guideline - each task should take minutes.

    Thank you,
    Mark
    On Mon, Apr 20, 2009 at 7:42 AM, Jean-Daniel Cryans wrote:

    Mark,

    There is a setup price when using Hadoop, for each task a new JVM must
    be spawned. On such a small scale, you won't see any good using MR.

    J-D

    On Mon, Apr 20, 2009 at 12:26 AM, Mark Kerzner <markkerzner@gmail.com>
    wrote:
    Hi,

    I ran a Hadoop MapReduce task in the local mode, reading and writing from
    HDFS, and it took 2.5 minutes. Essentially the same operations on the local
    file system without MapReduce took 1/2 minute.  Is this to be expected?

    It seemed that the system lost most of the time in the MapReduce
    operation,
    such as after these messages

    09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:23:01 INFO mapred.JobClient:  map 100% reduce 92%
    09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

    it waited for a long time. The final output lines were

    09/04/19 23:24:12 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:24:12 INFO mapred.TaskRunner: Task
    'attempt_local_0001_r_000000_0' done.
    09/04/19 23:24:12 INFO mapred.TaskRunner: Saved output of task
    'attempt_local_0001_r_000000_0' to hdfs://localhost/output
    09/04/19 23:24:13 INFO mapred.JobClient: Job complete: job_local_0001
    09/04/19 23:24:13 INFO mapred.JobClient: Counters: 13
    09/04/19 23:24:13 INFO mapred.JobClient:   File Systems
    09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes read=138103444
    09/04/19 23:24:13 INFO mapred.JobClient:     HDFS bytes written=107357785
    09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes read=282509133
    09/04/19 23:24:13 INFO mapred.JobClient:     Local bytes
    written=376697552
    09/04/19 23:24:13 INFO mapred.JobClient:   Map-Reduce Framework
    09/04/19 23:24:13 INFO mapred.JobClient:     Reduce input groups=184
    09/04/19 23:24:13 INFO mapred.JobClient:     Combine output records=185
    09/04/19 23:24:13 INFO mapred.JobClient:     Map input records=209
    09/04/19 23:24:13 INFO mapred.JobClient:     Reduce output records=184
    09/04/19 23:24:13 INFO mapred.JobClient:     Map output bytes=91863989
    09/04/19 23:24:13 INFO mapred.JobClient:     Map input bytes=69051592
    09/04/19 23:24:13 INFO mapred.JobClient:     Combine input records=185
    09/04/19 23:24:13 INFO mapred.JobClient:     Map output records=209
    09/04/19 23:24:13 INFO mapred.JobClient:     Reduce input records=184
  • Arun C Murthy at Apr 20, 2009 at 3:27 pm

    On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:

    Hi,

    I ran a Hadoop MapReduce task in the local mode, reading and writing
    from
    HDFS, and it took 2.5 minutes. Essentially the same operations on
    the local
    file system without MapReduce took 1/2 minute. Is this to be
    expected?
    Hmm... some overhead is expected, but this seems too much. What
    version of Hadoop are your running?

    It's hard to help without more details about your application,
    configuration etc., I'll try...

    It seemed that the system lost most of the time in the MapReduce
    operation,
    such as after these messages

    09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:23:01 INFO mapred.JobClient: map 100% reduce 92%
    09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

    it waited for a long time. The final output lines were
    It could either be the reduce-side merge or the hdfs-write. Can you
    check your task-logs and data-node logs?
    09/04/19 23:24:13 INFO mapred.JobClient: Combine input records=185
    09/04/19 23:24:13 INFO mapred.JobClient: Combine output
    records=185
    That shows that the combiner is useless for this app, turn it off - it
    adds unnecessary overhead.
    09/04/19 23:24:13 INFO mapred.JobClient: File Systems
    09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes read=138103444
    09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes
    written=107357785
    09/04/19 23:24:13 INFO mapred.JobClient: Local bytes
    read=282509133
    09/04/19 23:24:13 INFO mapred.JobClient: Local bytes
    written=376697552
    For the amount of data you are processing, you are doing far too much
    local-disk i/o.
    'Local bytes written' should be _very_ close to the 'Map output bytes'
    i.e 91M for 'maps' and zero for reduces. (See the counters-table on
    the job-details web-ui.)

    There are a few knobs you need to tweak to get closer to more optimal
    performance, the good news is that it's doable - the bad news is that
    one _has_ to get his/her fingers dirty...

    Some knobs you will be interested in are:

    Map-side:
    •io.sort.mb
    •io.sort.factor
    •io.sort.record.percent
    •io.sort.spill.percent

    Reduce-side
    * mapred.reduce.parallel.copies
    * mapred.reduce.copy.backoff
    * mapred.job.shuffle.input.buffer.percent
    * mapred.job.shuffle.merge.percent
    * mapred.inmem.merge.threshold
    * mapred.job.reduce.input.buffer.percent


    Check description for each of them in hadoop-default.xml or mapred-
    default.xml (depending on the version of Hadoop you are running).
    Some more details available here: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf

    hth,
    Arun
  • Mark Kerzner at Apr 20, 2009 at 3:37 pm
    Arun, thank you very much for the answer. I will turn off the combiner. I am
    debugging intermediate MR steps now, so I am mostly interested in
    performance to for this, and real tuning will be later, in a cluster. I am
    running 18.3, but general pointers should be good enough at this stage.
    I am doing a lot of I/O, because I write every binary file that I get from a
    SequentialFile back to local dist, in order to process it with Tika. This
    hasn't added significant overhead, although I think that Tika allows me to
    give it an IOStream, which is really what I should do.

    Oh, and yes, I already had performance problems writing to SequentialFile,
    when I was copying binary files from the local file system. I hoped that
    writing from HDFS to HDFS will be faster - but maybe not. Anyway, it may all
    change on different hardware, and I am going to try it on Elastic MapReduce
    on Amazon.

    And thank you for the link - I wish I were at the conference! Anyway, at
    this level I have to make my hands dirty, re-read both Hadoop books, and
    other article.

    Cheers,
    Mark
    On Mon, Apr 20, 2009 at 10:24 AM, Arun C Murthy wrote:


    On Apr 20, 2009, at 9:56 AM, Mark Kerzner wrote:

    Hi,
    I ran a Hadoop MapReduce task in the local mode, reading and writing from
    HDFS, and it took 2.5 minutes. Essentially the same operations on the
    local
    file system without MapReduce took 1/2 minute. Is this to be expected?
    Hmm... some overhead is expected, but this seems too much. What version of
    Hadoop are your running?

    It's hard to help without more details about your application,
    configuration etc., I'll try...


    It seemed that the system lost most of the time in the MapReduce
    operation,
    such as after these messages

    09/04/19 23:23:01 INFO mapred.LocalJobRunner: reduce > reduce
    09/04/19 23:23:01 INFO mapred.JobClient: map 100% reduce 92%
    09/04/19 23:23:04 INFO mapred.LocalJobRunner: reduce > reduce

    it waited for a long time. The final output lines were
    It could either be the reduce-side merge or the hdfs-write. Can you check
    your task-logs and data-node logs?

    09/04/19 23:24:13 INFO mapred.JobClient: Combine input records=185
    09/04/19 23:24:13 INFO mapred.JobClient: Combine output records=185
    That shows that the combiner is useless for this app, turn it off - it adds
    unnecessary overhead.

    09/04/19 23:24:13 INFO mapred.JobClient: File Systems
    09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes read=138103444
    09/04/19 23:24:13 INFO mapred.JobClient: HDFS bytes written=107357785
    09/04/19 23:24:13 INFO mapred.JobClient: Local bytes read=282509133
    09/04/19 23:24:13 INFO mapred.JobClient: Local bytes written=376697552
    For the amount of data you are processing, you are doing far too much
    local-disk i/o.
    'Local bytes written' should be _very_ close to the 'Map output bytes' i.e
    91M for 'maps' and zero for reduces. (See the counters-table on the
    job-details web-ui.)

    There are a few knobs you need to tweak to get closer to more optimal
    performance, the good news is that it's doable - the bad news is that one
    _has_ to get his/her fingers dirty...

    Some knobs you will be interested in are:

    Map-side:
    •io.sort.mb
    •io.sort.factor
    •io.sort.record.percent
    •io.sort.spill.percent

    Reduce-side
    * mapred.reduce.parallel.copies
    * mapred.reduce.copy.backoff
    * mapred.job.shuffle.input.buffer.percent
    * mapred.job.shuffle.merge.percent
    * mapred.inmem.merge.threshold
    * mapred.job.reduce.input.buffer.percent


    Check description for each of them in hadoop-default.xml or
    mapred-default.xml (depending on the version of Hadoop you are running).
    Some more details available here:
    http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/TuningAndDebuggingMapReduce_ApacheConEU09.pdf

    hth,
    Arun

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 20, '09 at 4:27a
activeApr 20, '09 at 3:37p
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase