FAQ
Hello Hadoopers-
I'm attempting to run some large-memory map tasks with using hadoop
streaming, but I seem to be running afoul of the mapred.child.ulimit
restriction, which is set to 2097152. I assume this is in KB since my
tasks fail when they get to about 2GB (I just need to get to about
2.3GB- almost there!). So far, nothing I've tried has succeeded in
changing this value. I've attempted to add
-jobconf mapred.child.ulimt=3000000
to the streaming command line, but to no avail. In the job's xml file
that I find in my logs, it's still got the old value. And worse, in
my task logs I see the message:
"attempt to override final parameter: mapred.child.ulimit; Ignoring."
which doesn't exactly inspire confidence that I'm on the right path.

I see there's been a fair amount of traffic on Jira about large memory
jobs, but there doesn't seem to be much in the way of examples or
documentation. Can someone tell me how to run such a job, especially
a streaming job?

Many thanks in advance--
Chris
ps. I'm running an 18.3 cluster on Amazon EC2 (I've been using the
Cloudera convenience scripts, but I can abandon this if I need more
control). The instances have plenty of memory (7.5GB).

Search Discussions

  • Allen Wittenauer at Sep 15, 2009 at 5:20 pm

    On 9/14/09 10:42 PM, "Chris Dyer" wrote:
    And worse, in
    my task logs I see the message:
    "attempt to override final parameter: mapred.child.ulimit; Ignoring."
    which doesn't exactly inspire confidence that I'm on the right path.
    Chances are the param has been marked final in the task tracker's running
    config which will prevent you overriding the value with a job specific
    configuration.
    ps. I'm running an 18.3 cluster on Amazon EC2 (I've been using the
    Cloudera convenience scripts, but I can abandon this if I need more
    control). The instances have plenty of memory (7.5GB).
    Depending upon how many tasks per node, that may not be enough. Streaming
    jobs eat a crapton (I'm pretty sure that is an SI unit) of memory. If you
    are hitting 2gb+, that means you can probably run 3 tasks max without
    swapping. [Don't forget to count the size of the task tracker JVM, the
    streaming.jar JVM, etc, and be cognizant of the fact that JVM mem size !=
    Java heap size.]
  • Chris Dyer at Sep 15, 2009 at 5:34 pm

    my task logs I see the message:
    "attempt to override final parameter: mapred.child.ulimit;  Ignoring."
    which doesn't exactly inspire confidence that I'm on the right path.
    Chances are the param has been marked final in the task tracker's running
    config which will prevent you overriding the value with a job specific
    configuration.
    Do you have any idea how one unmarks such a thing? Do I just need to
    edit the configuration file for the task tracker?
    Depending upon how many tasks per node, that may not be enough. Streaming
    jobs eat a crapton (I'm pretty sure that is an SI unit) of memory.  If you
    Is there any particular reason for the excessive memory use? I
    realize this is Java, but it's just sloshing data down to my
    processes...
    are hitting 2gb+, that means you can probably run 3 tasks max without
    swapping.  [Don't forget to count the size of the task tracker JVM, the
    streaming.jar JVM, etc, and be cognizant of the fact that JVM mem size !=
    Java heap size.]
    I'm seeing the failures even when I run a single job. But, obviously
    I don't want to schedule more than 3 jobs on a node since they won't
    have enough memory. How does one change the number of map slots per
    node? I'm a hadoop configuration newbie (which is why I was
    originally excited about the Cloudera EC2 scripts...)

    -Chris
  • Steve Loughran at Sep 16, 2009 at 10:01 am

    Chris Dyer wrote:
    my task logs I see the message:
    "attempt to override final parameter: mapred.child.ulimit; Ignoring."
    which doesn't exactly inspire confidence that I'm on the right path.
    Chances are the param has been marked final in the task tracker's running
    config which will prevent you overriding the value with a job specific
    configuration.
    Do you have any idea how one unmarks such a thing? Do I just need to
    edit the configuration file for the task tracker?
    Depending upon how many tasks per node, that may not be enough. Streaming
    jobs eat a crapton (I'm pretty sure that is an SI unit) of memory. If you
    Is there any particular reason for the excessive memory use? I
    realize this is Java, but it's just sloshing data down to my
    processes...
    Java6u14 + lets you run with "compressed pointers"; everyone is still
    playing with that but it does appear to reduce 64-bit memory use. If
    you were using 32 bit JVMs, stay with them, as even with compressed
    pointers, 64 bit JVMs use more memory per object instances.
    How does one change the number of map slots per
    node? I'm a hadoop configuration newbie (which is why I was
    originally excited about the Cloudera EC2 scripts...)
    From the code in front of my IDE

    maxMapSlots = conf.getInt("mapred.tasktracker.map.tasks.maximum", 2);
    maxReduceSlots =
    conf.getInt("mapred.tasktracker.reduce.tasks.maximum", 2);

    Those are conf values you have to tune.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 15, '09 at 5:43a
activeSep 16, '09 at 10:01a
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase