FAQ
Complete newby map/reduce question here. I am using hadoop streaming as
I come from a Perl background, and am trying to prototype/test a process
to load/clean-up ad server log lines from multiple input files into one
large file on the hdfs that can then be used as the source of a hive db
table.

I have a perl map script that reads an input line from stdin, does the
needed cleanup/manipulation, and writes back to stdout. I don't
really need a reduce step, as I don't care what order the lines are
written in, and there is no summary data to produce. When I run the job
with -reducer NONE I get valid output, however I get multiple part-xxxxx
files rather than one big file.

So I wrote a trivial 'reduce' script that reads from stdin and simply
splits the key/value, and writes the value back to stdout.

I am executing the code as follows:

./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
"/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
"/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
/logs/*.log -output test9

The code I have works when given a small set of input files. However, I
get the following error when attempting to run the code on a large set
of input files:

hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for
reduce task. Node
tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920
bytes free; but we expect reduce input to take 22138478392

I assume this is because the all the map output is being buffered in
memory prior to running the reduce step? If so, what can I change to
stop the buffering? I just need the map output to go directly to one
large file.

Thanks,
Scott

Search Discussions

  • Alex Loddengaard at Jun 10, 2009 at 11:58 pm
    What is mapred.child.ulimit set to? This configuration options specifics
    how much memory child processes are allowed to have. You may want to up
    this limit and see what happens.

    Let me know if that doesn't get you anywhere.

    Alex
    On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote:

    Complete newby map/reduce question here. I am using hadoop streaming as I
    come from a Perl background, and am trying to prototype/test a process to
    load/clean-up ad server log lines from multiple input files into one large
    file on the hdfs that can then be used as the source of a hive db table.
    I have a perl map script that reads an input line from stdin, does the
    needed cleanup/manipulation, and writes back to stdout. I don't really
    need a reduce step, as I don't care what order the lines are written in, and
    there is no summary data to produce. When I run the job with -reducer NONE
    I get valid output, however I get multiple part-xxxxx files rather than one
    big file.
    So I wrote a trivial 'reduce' script that reads from stdin and simply
    splits the key/value, and writes the value back to stdout.

    I am executing the code as follows:

    ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
    "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
    "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input /logs/*.log
    -output test9

    The code I have works when given a small set of input files. However, I
    get the following error when attempting to run the code on a large set of
    input files:

    hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 15:43:00,905
    WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node
    tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920
    bytes free; but we expect reduce input to take 22138478392

    I assume this is because the all the map output is being buffered in memory
    prior to running the reduce step? If so, what can I change to stop the
    buffering? I just need the map output to go directly to one large file.

    Thanks,
    Scott
  • Todd Lipcon at Jun 11, 2009 at 12:32 am
    Hey Scott,
    It turns out that Alex's answer was mistaken - your error is actually coming
    from lack of disk space on the TT that has been assigned the reduce task.
    Specifically, there is not enough space in mapred.local.dir. You'll need to
    change your mapred.local.dir to point to a partition that has enough space
    to contain your reduce output.

    As for why this is the case, I hope someone will pipe up. It seems to me
    that reduce output can go directly to the target filesystem without using
    space on mapred.local.dir.

    Thanks
    -Todd
    On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard wrote:

    What is mapred.child.ulimit set to? This configuration options specifics
    how much memory child processes are allowed to have. You may want to up
    this limit and see what happens.

    Let me know if that doesn't get you anywhere.

    Alex
    On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote:

    Complete newby map/reduce question here. I am using hadoop streaming as I
    come from a Perl background, and am trying to prototype/test a process to
    load/clean-up ad server log lines from multiple input files into one large
    file on the hdfs that can then be used as the source of a hive db table.
    I have a perl map script that reads an input line from stdin, does the
    needed cleanup/manipulation, and writes back to stdout. I don't really
    need a reduce step, as I don't care what order the lines are written in, and
    there is no summary data to produce. When I run the job with -reducer NONE
    I get valid output, however I get multiple part-xxxxx files rather than one
    big file.
    So I wrote a trivial 'reduce' script that reads from stdin and simply
    splits the key/value, and writes the value back to stdout.

    I am executing the code as follows:

    ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
    "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
    "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
    /logs/*.log
    -output test9

    The code I have works when given a small set of input files. However, I
    get the following error when attempting to run the code on a large set of
    input files:

    hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
    15:43:00,905
    WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node
    tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920
    bytes free; but we expect reduce input to take 22138478392

    I assume this is because the all the map output is being buffered in memory
    prior to running the reduce step? If so, what can I change to stop the
    buffering? I just need the map output to go directly to one large file.

    Thanks,
    Scott
  • Jason hadoop at Jun 11, 2009 at 6:13 am
    The reduce output may spill to disk during the sort, and if it expected to
    be larger than the partition free space, unless the machine/jvm has a hugh
    allowed memory space, the data will spill to disk during the sort.
    If I did my math correctly, you are trying to push ~2TB through the single
    reduce.

    as for the part-XXXX files, if you have the number of reduces set to zero,
    you will get N part files, where N is the number of map tasks.

    If you absolutely must have it all go to one reduce, you will need to
    increase the free disk space. I think 19.1 preserves compression for the map
    output, so you could try enabling compression for map output.

    If you have many nodes, you can set the number of reduces to some number and
    then use sort -M on the part files, to merge sort them, assuming your reduce
    preserves ordering.

    Try adding these parameters to your job line:
    -D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK

    BTW, /bin/cat works fine as an identity mapper or an identity reducer

    On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon wrote:

    Hey Scott,
    It turns out that Alex's answer was mistaken - your error is actually
    coming
    from lack of disk space on the TT that has been assigned the reduce task.
    Specifically, there is not enough space in mapred.local.dir. You'll need to
    change your mapred.local.dir to point to a partition that has enough space
    to contain your reduce output.

    As for why this is the case, I hope someone will pipe up. It seems to me
    that reduce output can go directly to the target filesystem without using
    space on mapred.local.dir.

    Thanks
    -Todd
    On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard wrote:

    What is mapred.child.ulimit set to? This configuration options specifics
    how much memory child processes are allowed to have. You may want to up
    this limit and see what happens.

    Let me know if that doesn't get you anywhere.

    Alex
    On Wed, Jun 10, 2009 at 9:40 AM, Scott wrote:

    Complete newby map/reduce question here. I am using hadoop streaming
    as
    I
    come from a Perl background, and am trying to prototype/test a process
    to
    load/clean-up ad server log lines from multiple input files into one large
    file on the hdfs that can then be used as the source of a hive db
    table.
    I have a perl map script that reads an input line from stdin, does the
    needed cleanup/manipulation, and writes back to stdout. I don't
    really
    need a reduce step, as I don't care what order the lines are written
    in,
    and
    there is no summary data to produce. When I run the job with -reducer NONE
    I get valid output, however I get multiple part-xxxxx files rather than one
    big file.
    So I wrote a trivial 'reduce' script that reads from stdin and simply
    splits the key/value, and writes the value back to stdout.

    I am executing the code as follows:

    ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
    "/usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl" -reducer
    "/usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl" -input
    /logs/*.log
    -output test9

    The code I have works when given a small set of input files. However,
    I
    get the following error when attempting to run the code on a large set
    of
    input files:

    hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
    15:43:00,905
    WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node
    tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has
    2004049920
    bytes free; but we expect reduce input to take 22138478392

    I assume this is because the all the map output is being buffered in memory
    prior to running the reduce step? If so, what can I change to stop the
    buffering? I just need the map output to go directly to one large
    file.
    Thanks,
    Scott


    --
    Pro Hadoop, a book to guide you from beginner to hadoop mastery,
    http://www.apress.com/book/view/9781430219422
    www.prohadoopbook.com a community for Hadoop Professionals

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 10, '09 at 4:40p
activeJun 11, '09 at 6:13a
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase