FAQ
1. Do the reducers of a job start only after all mappers have finished?

2. Say there are 10 slave nodes. Let us say one of the nodes is very
slow as compared to other nodes. So, while the mappers in the other 9
have finished in 2 minutes, the one on the slow one might take 20
minutes. Is Hadoop intelligent enough to redistribute the key-value
pairs assigned to this slow node to the free nodes and start new
mappers on them?

3. Is the above true for reducers also?

4. Is it possible to run more than one mapper or one reducer per slave
node? If yes, can the number of mappers per node or number of reducers
per node be set anywhere in the conf files?

Search Discussions

  • Todd Lipcon at May 6, 2009 at 8:22 pm

    On Wed, May 6, 2009 at 12:22 PM, Foss User wrote:

    1. Do the reducers of a job start only after all mappers have finished?
    The reducer tasks start so they can begin copying map output, but your
    actual reduce function does not. This is because it doesn't know that the
    data for any given key has been completely generated by the map stage until
    the map stage is complete.

    2. Say there are 10 slave nodes. Let us say one of the nodes is very
    slow as compared to other nodes. So, while the mappers in the other 9
    have finished in 2 minutes, the one on the slow one might take 20
    minutes. Is Hadoop intelligent enough to redistribute the key-value
    pairs assigned to this slow node to the free nodes and start new
    mappers on them?
    No. However, if you enable "speculative execution", it will schedule a
    second invocation of slow map tasks. This is helpful if the slowness is due
    to node-specific issues (eg a disk going bad or external load for whatever
    reason) but doesn't help at all if the data is intrinsically slow to
    process.

    3. Is the above true for reducers also?
    Speculative execution can be enabled separately for the mappers and
    reducers.

    The relevant configuration variables are
    mapred.map.tasks.speculative.execution and
    mapred.reduce.tasks.speculative.execution

    4. Is it possible to run more than one mapper or one reducer per slave
    node? If yes, can the number of mappers per node or number of reducers
    per node be set anywhere in the conf files?
    Yes. See mapred.tasktracker.map.tasks.maximum and
    mapred.tasktracker.reduce.tasks.maximum

    -Todd
  • Foss User at May 6, 2009 at 8:47 pm
    Thanks for your response. I got a few more questions regarding optimizations.

    1. Does hadoop clients locally cache the data it last requested?

    2. Is the meta data for file blocks on data node kept in the
    underlying OS's file system on namenode or is it kept in RAM of the
    name node?

    3. If no mapper more mapper functions can be run on the node that
    contains the data on which the mapper has to act on, is Hadoop
    intelligent enough to run the new mappers on some machines within the
    same rack?

    4. When can a case like the above happen? I mean when can it happen
    that the maximum number of mappers for a tasktracker configure has
    been reached but Hadoop still needs to start more mappers?

    5. Are the multiple mappers and reducers run as separate threads
    within the same TaskTracker process?
  • Todd Lipcon at May 6, 2009 at 9:01 pm

    On Wed, May 6, 2009 at 1:46 PM, Foss User wrote:

    Thanks for your response. I got a few more questions regarding
    optimizations.

    1. Does hadoop clients locally cache the data it last requested?
    I don't know the DFS read path very well, but I don't believe there is any
    built in cache here. There is an undocumented configuration variable
    dfs.read.prefetch.size which affects DFSClient's prefetching of data ahead
    of the current file position, but I don't want to give any answer I'm not
    certain of. Hopefully someone else will chime in here.

    I will answer that there is no *large* cache of data locally. HDFS is
    optimized for sequential reads, where a cache is generally useless if not
    detrimental.

    2. Is the meta data for file blocks on data node kept in the
    underlying OS's file system on namenode or is it kept in RAM of the
    name node?
    The block locations are kept in the RAM of the name node, and are updated
    whenever a Datanode does a "block report". This is why the namenode is in
    "safe mode" at startup until it has received block locations for some
    configurable percentage of blocks from the datanodes.

    3. If no mapper more mapper functions can be run on the node that
    contains the data on which the mapper has to act on, is Hadoop
    intelligent enough to run the new mappers on some machines within the
    same rack?
    Yes, assuming you have configured a network topology script. Otherwise,
    Hadoop has no magical knowledge of your network infrastructure, and it
    treats the whole cluster as a single rack called /default-rack

    4. When can a case like the above happen? I mean when can it happen
    that the maximum number of mappers for a tasktracker configure has
    been reached but Hadoop still needs to start more mappers?
    If you have file with 100 blocks all on the same three nodes, but you have a
    six node cluster, it will schedule some tasks on nodes that do not contain
    the blocks, since it would rather keep the cluster utilized than keep all
    data access local.

    5. Are the multiple mappers and reducers run as separate threads
    within the same TaskTracker process?
    No, they are run as child processes.

    -Todd
  • Foss User at May 7, 2009 at 5:06 am
    Thanks for your response again. I could not understand a few things in
    your reply. So, I want to clarify them. Please find my questions
    inline.
    On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon wrote:
    On Wed, May 6, 2009 at 1:46 PM, Foss User wrote:
    2. Is the meta data for file blocks on data node kept in the
    underlying OS's file system on namenode or is it kept in RAM of the
    name node?
    The block locations are kept in the RAM of the name node, and are updated
    whenever a Datanode does a "block report". This is why the namenode is in
    "safe mode" at startup until it has received block locations for some
    configurable percentage of blocks from the datanodes.
    What is "safe mode" in namenode? This concept is new to me. Could you
    please explain this?
    3. If no mapper more mapper functions can be run on the node that
    contains the data on which the mapper has to act on, is Hadoop
    intelligent enough to run the new mappers on some machines within the
    same rack?
    Yes, assuming you have configured a network topology script. Otherwise,
    Hadoop has no magical knowledge of your network infrastructure, and it
    treats the whole cluster as a single rack called /default-rack
    Is it a network topology script or is it a Java plugin code? AFAIK, we
    need to write an implementation of
    org.apache.hadoop.net.DNSToSwitchMapping interface. Can we write it as
    a script or configuration file and avoid Java coding to achieve this?
    If so, how?
  • Tom White at May 7, 2009 at 8:32 am

    On Thu, May 7, 2009 at 6:05 AM, Foss User wrote:
    Thanks for your response again. I could not understand a few things in
    your reply. So, I want to clarify them. Please find my questions
    inline.
    On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon wrote:
    On Wed, May 6, 2009 at 1:46 PM, Foss User wrote:
    2. Is the meta data for file blocks on data node kept in the
    underlying OS's file system on namenode or is it kept in RAM of the
    name node?
    The block locations are kept in the RAM of the name node, and are updated
    whenever a Datanode does a "block report". This is why the namenode is in
    "safe mode" at startup until it has received block locations for some
    configurable percentage of blocks from the datanodes.
    What is "safe mode" in namenode? This concept is new to me. Could you
    please explain this?
    Safe mode is described here:
    http://hadoop.apache.org/core/docs/r0.20.0/hdfs_design.html#Safemode
    3. If no mapper more mapper functions can be run on the node that
    contains the data on which the mapper has to act on, is Hadoop
    intelligent enough to run the new mappers on some machines within the
    same rack?
    Yes, assuming you have configured a network topology script. Otherwise,
    Hadoop has no magical knowledge of your network infrastructure, and it
    treats the whole cluster as a single rack called /default-rack
    Is it a network topology script or is it a Java plugin code? AFAIK, we
    need to write an implementation of
    org.apache.hadoop.net.DNSToSwitchMapping interface. Can we write it as
    a script or configuration file and avoid Java coding to achieve this?
    If so, how?
    To tell Hadoop about your network topology you can either write a Java
    implementation of org.apache.hadoop.net.DNSToSwitchMapping or you can
    write a script in another language. There are more details at
    http://hadoop.apache.org/core/docs/r0.20.0/cluster_setup.html#Hadoop+Rack+Awareness
    and a sample script at
    http://www.nabble.com/Hadoop-topology.script.file.name-Form-td17683521.html

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 6, '09 at 7:22p
activeMay 7, '09 at 8:32a
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase