I've recently come across some interesting things happening within a
50-node cluster regarding the tasktrackers and task attempts.
Essentially tasks are being created but they are sticking at 0.0% and it
seems the 'mapreduce.task.timeout' isn't taking effect and they just sit
there (for days if we let them) and the jobs have to get killed. Its
interesting to note that the HDFS datanode service and HBASE
regionserver running on these nodes work fine and we've been simply
shutting down the tasktracker service on them in order to get around
jobs stalling forever.



Some historical information... We're running Cloudera's cdh3u0 release,
and this has so far only happened on a handful of random tasktracker
nodes and it seems to only effected those that have been taken down for
maintenance and then brought back into the cluster, or alternatively one
node was brought into the cluster after it had been running for a while
and we ran into the same issue. After re-adding the nodes back into the
cluster the tasktracker service starts getting these stalls. Also know
that this has not happened to every node that has been taken out of
service for a time and then re-added... I would say about 1/3'rd of them
or so has ran into this issue after maintenance. The particular
maintenance issues on the effected nodes were NOT the same, i.e. one was
bad ram another was a bad sector on a disk etc... never the same initial
problem only the same outcome after rejoining the cluster.



It's also never the same mapred job that sticks, nor is there any time
related evidence relating the stalls to a specific time of day. Rather
the node will run fine for many jobs and then just all of a sudden some
tasks will stall and stick at 0.0%. There are no visible errors in the
log outputs, although nothing will move forward nor will it release the
mappers for any other jobs to use until the stalled job is killed. It
seems that the default 'mapreduce.task.timeout' just isn't working for
some reason.



Has anyone come across anything similar to this? I can provide more
details/data as needed.



John Miller | Sr. Linux Systems Administrator

<http://mybuys.com/>

530 E. Liberty St.

Ann Arbor, MI 48104

Direct: 734.922.7007

http://mybuys.com/ <http://mybuys.com/>

Search Discussions

  • Arun C Murthy at Dec 15, 2011 at 7:04 pm
    Hi John,

    It's hard for folks on this list to diagnose CDH (you might have to ask their lists). However, I haven't seen similar issues with hadoop-0.20.2xx in a while.

    One thing to check would be to grab a stack trace (jstack) on the tasks to see what they are upto. Next, try get a tcpdump to see if the tasks are indeed sending heartbeats to the TT, which might be the reason the TTs aren't timing them out.

    hth,
    Arun
    On Dec 15, 2011, at 7:58 AM, John Miller wrote:

    I’ve recently come across some interesting things happening within a 50-node cluster regarding the tasktrackers and task attempts. Essentially tasks are being created but they are sticking at 0.0% and it seems the ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for days if we let them) and the jobs have to get killed. Its interesting to note that the HDFS datanode service and HBASE regionserver running on these nodes work fine and we’ve been simply shutting down the tasktracker service on them in order to get around jobs stalling forever.

    Some historical information… We’re running Cloudera’s cdh3u0 release, and this has so far only happened on a handful of random tasktracker nodes and it seems to only effected those that have been taken down for maintenance and then brought back into the cluster, or alternatively one node was brought into the cluster after it had been running for a while and we ran into the same issue. After re-adding the nodes back into the cluster the tasktracker service starts getting these stalls. Also know that this has not happened to every node that has been taken out of service for a time and then re-added… I would say about 1/3’rd of them or so has ran into this issue after maintenance. The particular maintenance issues on the effected nodes were NOT the same, i.e. one was bad ram another was a bad sector on a disk etc… never the same initial problem only the same outcome after rejoining the cluster.

    It’s also never the same mapred job that sticks, nor is there any time related evidence relating the stalls to a specific time of day. Rather the node will run fine for many jobs and then just all of a sudden some tasks will stall and stick at 0.0%. There are no visible errors in the log outputs, although nothing will move forward nor will it release the mappers for any other jobs to use until the stalled job is killed. It seems that the default ‘mapreduce.task.timeout’ just isn’t working for some reason.

    Has anyone come across anything similar to this? I can provide more details/data as needed.

    John Miller | Sr. Linux Systems Administrator
    <image001.png>
    530 E. Liberty St.
    Ann Arbor, MI 48104
    Direct: 734.922.7007
    http://mybuys.com/
  • John Miller at Dec 15, 2011 at 8:57 pm
    Hello Arun,



    Thanks for the quick reply. I totally understand the CDH issue but
    figured I'd ask the broader community as well in case there was any
    upstream known issue as I've noticed some patches relating to "somewhat
    similar" issues.



    The jstack was currently on my radar but I hadn't even thought about
    tcpdump to catch weather the tasks were heartbeating or not so thanks
    for the tip, will make sure to check that out! We are also planning our
    release update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
    0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
    the issue as well, in which case I'll at least let everyone here know if
    it does.



    Any further ideas or if anyone else has experienced this similar issue
    my ears are open. Thanks again Arun! J



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: Arun C Murthy
    Sent: Thursday, December 15, 2011 2:03 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
    working)



    Hi John,



    It's hard for folks on this list to diagnose CDH (you might have to ask
    their lists). However, I haven't seen similar issues with
    hadoop-0.20.2xx in a while.



    One thing to check would be to grab a stack trace (jstack) on the tasks
    to see what they are upto. Next, try get a tcpdump to see if the tasks
    are indeed sending heartbeats to the TT, which might be the reason the
    TTs aren't timing them out.



    hth,

    Arun



    On Dec 15, 2011, at 7:58 AM, John Miller wrote:





    I've recently come across some interesting things happening within a
    50-node cluster regarding the tasktrackers and task attempts.
    Essentially tasks are being created but they are sticking at 0.0% and it
    seems the 'mapreduce.task.timeout' isn't taking effect and they just sit
    there (for days if we let them) and the jobs have to get killed. Its
    interesting to note that the HDFS datanode service and HBASE
    regionserver running on these nodes work fine and we've been simply
    shutting down the tasktracker service on them in order to get around
    jobs stalling forever.



    Some historical information... We're running Cloudera's cdh3u0 release,
    and this has so far only happened on a handful of random tasktracker
    nodes and it seems to only effected those that have been taken down for
    maintenance and then brought back into the cluster, or alternatively one
    node was brought into the cluster after it had been running for a while
    and we ran into the same issue. After re-adding the nodes back into the
    cluster the tasktracker service starts getting these stalls. Also know
    that this has not happened to every node that has been taken out of
    service for a time and then re-added... I would say about 1/3'rd of them
    or so has ran into this issue after maintenance. The particular
    maintenance issues on the effected nodes were NOT the same, i.e. one was
    bad ram another was a bad sector on a disk etc... never the same initial
    problem only the same outcome after rejoining the cluster.



    It's also never the same mapred job that sticks, nor is there any time
    related evidence relating the stalls to a specific time of day. Rather
    the node will run fine for many jobs and then just all of a sudden some
    tasks will stall and stick at 0.0%. There are no visible errors in the
    log outputs, although nothing will move forward nor will it release the
    mappers for any other jobs to use until the stalled job is killed. It
    seems that the default 'mapreduce.task.timeout' just isn't working for
    some reason.



    Has anyone come across anything similar to this? I can provide more
    details/data as needed.



    John Miller | Sr. Linux Systems Administrator

    <image001.png> <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>
  • Rajesh balamohan at Dec 20, 2011 at 3:29 am
    Hi John,

    Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM
    arguments you use for the spawning the map/reduce slots?

    Check if the JVM is stuck in the machine. Sometimes I have seen task JVM
    just launching, gets into spinning mode and occupies 100% CPU.

    Can you check if this is the case?

    ~Rajesh Balamohan


    On Fri, Dec 16, 2011 at 2:26 AM, John Miller wrote:

    Hello Arun,****

    ** **

    Thanks for the quick reply. I totally understand the CDH issue but
    figured I’d ask the broader community as well in case there was any
    upstream known issue as I’ve noticed some patches relating to “somewhat
    similar” issues.****

    ** **

    The jstack was currently on my radar but I hadn’t even thought about
    tcpdump to catch weather the tasks were heartbeating or not so thanks for
    the tip, will make sure to check that out! We are also planning our release
    update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
    0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
    the issue as well, in which case I’ll at least let everyone here know if it
    does.****

    ** **

    Any further ideas or if anyone else has experienced this similar issue my
    ears are open. Thanks again Arun! J****

    ** **

    *John Miller **|* Sr. Linux Systems Administrator**

    [image: mybuys-ops-small] <http://mybuys.com/>**

    530 E. Liberty St.****

    Ann Arbor, MI 48104****

    Direct: 734.922.7007****

    *http://mybuys.com/*

    ** **

    *From:* Arun C Murthy
    *Sent:* Thursday, December 15, 2011 2:03 PM
    *To:* mapreduce-user@hadoop.apache.org
    *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout
    not working)****

    ** **

    Hi John,****

    ** **

    It's hard for folks on this list to diagnose CDH (you might have to ask
    their lists). However, I haven't seen similar issues with hadoop-0.20.2xx
    in a while.****

    ** **

    One thing to check would be to grab a stack trace (jstack) on the tasks
    to see what they are upto. Next, try get a tcpdump to see if the tasks are
    indeed sending heartbeats to the TT, which might be the reason the TTs
    aren't timing them out.****

    ** **

    hth,****

    Arun****

    ** **

    On Dec 15, 2011, at 7:58 AM, John Miller wrote:****



    ****

    I’ve recently come across some interesting things happening within a
    50-node cluster regarding the tasktrackers and task attempts. Essentially
    tasks are being created but they are sticking at 0.0% and it seems the
    ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for
    days if we let them) and the jobs have to get killed. Its interesting to
    note that the HDFS datanode service and HBASE regionserver running on these
    nodes work fine and we’ve been simply shutting down the tasktracker service
    on them in order to get around jobs stalling forever.****

    ****

    Some historical information… We’re running Cloudera’s cdh3u0 release, and
    this has so far only happened on a handful of random tasktracker nodes and
    it seems to only effected those that have been taken down for maintenance
    and then brought back into the cluster, or alternatively one node was
    brought into the cluster after it had been running for a while and we ran
    into the same issue. After re-adding the nodes back into the cluster the
    tasktracker service starts getting these stalls. Also know that this has
    not happened to every node that has been taken out of service for a time
    and then re-added… I would say about 1/3’rd of them or so has ran into this
    issue after maintenance. The particular maintenance issues on the effected
    nodes were NOT the same, i.e. one was bad ram another was a bad sector on a
    disk etc… never the same initial problem only the same outcome after
    rejoining the cluster.****

    ****

    It’s also never the same mapred job that sticks, nor is there any time
    related evidence relating the stalls to a specific time of day. Rather the
    node will run fine for many jobs and then just all of a sudden some tasks
    will stall and stick at 0.0%. There are no visible errors in the log
    outputs, although nothing will move forward nor will it release the mappers
    for any other jobs to use until the stalled job is killed. It seems that
    the default ‘mapreduce.task.timeout’ just isn’t working for some reason.**
    **

    ****

    Has anyone come across anything similar to this? I can provide more
    details/data as needed.****

    ****

    *John Miller **|* Sr. Linux Systems Administrator****

    <image001.png> <http://mybuys.com/>****

    530 E. Liberty St.****

    Ann Arbor, MI 48104****

    Direct: 734.922.7007****

    *http://mybuys.com/*****

    ****

    ** **
  • Todd Lipcon at Dec 20, 2011 at 4:09 am

    On Mon, Dec 19, 2011 at 7:29 PM, rajesh balamohan wrote:

    Hi John,

    Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what are the JVM
    arguments you use for the spawning the map/reduce slots?

    Check if the JVM is stuck in the machine. Sometimes I have seen task JVM
    just launching, gets into spinning mode and occupies 100% CPU.
    Yep, this one that Rajesh mentions is a RHEL 6 bug:
    https://bugzilla.redhat.com/show_bug.cgi?id=750419
    We can reproduce it in our RHEL6 QA clusters pretty reilably, but still
    working with RedHat to reproduce/fix.

    Thanks
    -Todd
    On Fri, Dec 16, 2011 at 2:26 AM, John Miller wrote:

    Hello Arun,****

    ** **

    Thanks for the quick reply. I totally understand the CDH issue but
    figured I’d ask the broader community as well in case there was any
    upstream known issue as I’ve noticed some patches relating to “somewhat
    similar” issues.****

    ** **

    The jstack was currently on my radar but I hadn’t even thought about
    tcpdump to catch weather the tasks were heartbeating or not so thanks for
    the tip, will make sure to check that out! We are also planning our release
    update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
    0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
    the issue as well, in which case I’ll at least let everyone here know if it
    does.****

    ** **

    Any further ideas or if anyone else has experienced this similar issue my
    ears are open. Thanks again Arun! J****

    ** **

    *John Miller **|* Sr. Linux Systems Administrator**

    [image: mybuys-ops-small] <http://mybuys.com/>**

    530 E. Liberty St.****

    Ann Arbor, MI 48104****

    Direct: 734.922.7007****

    *http://mybuys.com/*

    ** **

    *From:* Arun C Murthy
    *Sent:* Thursday, December 15, 2011 2:03 PM
    *To:* mapreduce-user@hadoop.apache.org
    *Subject:* Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout
    not working)****

    ** **

    Hi John,****

    ** **

    It's hard for folks on this list to diagnose CDH (you might have to ask
    their lists). However, I haven't seen similar issues with hadoop-0.20.2xx
    in a while.****

    ** **

    One thing to check would be to grab a stack trace (jstack) on the tasks
    to see what they are upto. Next, try get a tcpdump to see if the tasks are
    indeed sending heartbeats to the TT, which might be the reason the TTs
    aren't timing them out.****

    ** **

    hth,****

    Arun****

    ** **

    On Dec 15, 2011, at 7:58 AM, John Miller wrote:****



    ****

    I’ve recently come across some interesting things happening within a
    50-node cluster regarding the tasktrackers and task attempts. Essentially
    tasks are being created but they are sticking at 0.0% and it seems the
    ‘mapreduce.task.timeout’ isn’t taking effect and they just sit there (for
    days if we let them) and the jobs have to get killed. Its interesting to
    note that the HDFS datanode service and HBASE regionserver running on these
    nodes work fine and we’ve been simply shutting down the tasktracker service
    on them in order to get around jobs stalling forever.****

    ****

    Some historical information… We’re running Cloudera’s cdh3u0 release, and
    this has so far only happened on a handful of random tasktracker nodes and
    it seems to only effected those that have been taken down for maintenance
    and then brought back into the cluster, or alternatively one node was
    brought into the cluster after it had been running for a while and we ran
    into the same issue. After re-adding the nodes back into the cluster the
    tasktracker service starts getting these stalls. Also know that this has
    not happened to every node that has been taken out of service for a time
    and then re-added… I would say about 1/3’rd of them or so has ran into this
    issue after maintenance. The particular maintenance issues on the effected
    nodes were NOT the same, i.e. one was bad ram another was a bad sector on a
    disk etc… never the same initial problem only the same outcome after
    rejoining the cluster.****

    ****

    It’s also never the same mapred job that sticks, nor is there any time
    related evidence relating the stalls to a specific time of day. Rather the
    node will run fine for many jobs and then just all of a sudden some tasks
    will stall and stick at 0.0%. There are no visible errors in the log
    outputs, although nothing will move forward nor will it release the mappers
    for any other jobs to use until the stalled job is killed. It seems that
    the default ‘mapreduce.task.timeout’ just isn’t working for some reason.*
    ***

    ****

    Has anyone come across anything similar to this? I can provide more
    details/data as needed.****

    ****

    *John Miller **|* Sr. Linux Systems Administrator****

    <image001.png> <http://mybuys.com/>****

    530 E. Liberty St.****

    Ann Arbor, MI 48104****

    Direct: 734.922.7007****

    *http://mybuys.com/*****

    ****

    ** **

    --
    Todd Lipcon
    Software Engineer, Cloudera
  • John Miller at Dec 20, 2011 at 6:58 pm
    We're running jdk1.6.0_26 at this time and below is what we're calling
    the jobs with. If by spawning the map/reduce slots you mean what
    arguments are our internal classes using I would have to go back to the
    dev team and do some more digging.



    $HADOOP_HOME/bin/hadoop jar ${JAR_FILE} com.our.private.class ${D_OPTS}
    '-Dmapred.child.java.opts=-Xmx2g -server -Xss128k'
    -Dmapred.reduce.tasks=${REDUCER_COUNT} -libjars ${LIB_JARS} ${SOURCE}
    ${OUTPUT_DIR} ${HDFS_PREFIX} ${HBASE_PREFIX}



    I don't' recall seeing the 100% CPU issue with the child JVM's when this
    happens. Unfortunately I'm also not able to replicate this again until
    after the holidays (it's our busy season and we're in a holding
    pattern).



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: Todd Lipcon
    Sent: Monday, December 19, 2011 11:09 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
    working)



    On Mon, Dec 19, 2011 at 7:29 PM, rajesh balamohan wrote:

    Hi John,



    Which version of JVM are you using? ( JDK 1.6.0.2xx?) and what
    are the JVM arguments you use for the spawning the map/reduce slots?



    Check if the JVM is stuck in the machine. Sometimes I have seen
    task JVM just launching, gets into spinning mode and occupies 100% CPU.



    Yep, this one that Rajesh mentions is a RHEL 6 bug:
    https://bugzilla.redhat.com/show_bug.cgi?id=750419

    We can reproduce it in our RHEL6 QA clusters pretty reilably, but still
    working with RedHat to reproduce/fix.



    Thanks

    -Todd



    On Fri, Dec 16, 2011 at 2:26 AM, John Miller wrote:

    Hello Arun,



    Thanks for the quick reply. I totally understand the CDH issue
    but figured I'd ask the broader community as well in case there was any
    upstream known issue as I've noticed some patches relating to "somewhat
    similar" issues.



    The jstack was currently on my radar but I hadn't even thought
    about tcpdump to catch weather the tasks were heartbeating or not so
    thanks for the tip, will make sure to check that out! We are also
    planning our release update to CDH 3u2 vs. 3u0 which will give us the
    updated hadoop 0.20.2+923.142 vs. our current 0.20.2+923.21 which may
    inadvertently fix the issue as well, in which case I'll at least let
    everyone here know if it does.



    Any further ideas or if anyone else has experienced this similar
    issue my ears are open. Thanks again Arun! J



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: Arun C Murthy
    Sent: Thursday, December 15, 2011 2:03 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: Tasktracker Task Attempts Stuck
    (mapreduce.task.timeout not working)



    Hi John,



    It's hard for folks on this list to diagnose CDH (you might
    have to ask their lists). However, I haven't seen similar issues with
    hadoop-0.20.2xx in a while.



    One thing to check would be to grab a stack trace (jstack) on
    the tasks to see what they are upto. Next, try get a tcpdump to see if
    the tasks are indeed sending heartbeats to the TT, which might be the
    reason the TTs aren't timing them out.



    hth,

    Arun



    On Dec 15, 2011, at 7:58 AM, John Miller wrote:



    I've recently come across some interesting things happening
    within a 50-node cluster regarding the tasktrackers and task attempts.
    Essentially tasks are being created but they are sticking at 0.0% and it
    seems the 'mapreduce.task.timeout' isn't taking effect and they just sit
    there (for days if we let them) and the jobs have to get killed. Its
    interesting to note that the HDFS datanode service and HBASE
    regionserver running on these nodes work fine and we've been simply
    shutting down the tasktracker service on them in order to get around
    jobs stalling forever.



    Some historical information... We're running Cloudera's cdh3u0
    release, and this has so far only happened on a handful of random
    tasktracker nodes and it seems to only effected those that have been
    taken down for maintenance and then brought back into the cluster, or
    alternatively one node was brought into the cluster after it had been
    running for a while and we ran into the same issue. After re-adding the
    nodes back into the cluster the tasktracker service starts getting these
    stalls. Also know that this has not happened to every node that has
    been taken out of service for a time and then re-added... I would say
    about 1/3'rd of them or so has ran into this issue after maintenance.
    The particular maintenance issues on the effected nodes were NOT the
    same, i.e. one was bad ram another was a bad sector on a disk etc...
    never the same initial problem only the same outcome after rejoining the
    cluster.



    It's also never the same mapred job that sticks, nor is there
    any time related evidence relating the stalls to a specific time of day.
    Rather the node will run fine for many jobs and then just all of a
    sudden some tasks will stall and stick at 0.0%. There are no visible
    errors in the log outputs, although nothing will move forward nor will
    it release the mappers for any other jobs to use until the stalled job
    is killed. It seems that the default 'mapreduce.task.timeout' just
    isn't working for some reason.



    Has anyone come across anything similar to this? I can provide
    more details/data as needed.



    John Miller | Sr. Linux Systems Administrator

    <image001.png> <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>













    --
    Todd Lipcon
    Software Engineer, Cloudera
  • John Miller at Dec 28, 2011 at 7:09 pm
    Here's a jstack from the jobtracker of an instance of this issue we hit
    today. Unfortunately all of the tasks inside each mapper job have "no
    task attemps found" and there are multiple jobs stuck at 89.49% and a
    couple at 99.9% mapping. Nothing has continued for hours. Any new jobs
    submitted get stuck the same as the others. I was unable to grab a
    tcpdump at this time from the tasktrackers heartbeats since there are no
    child vm's for them.



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: John Miller
    Sent: Thursday, December 15, 2011 3:56 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
    working)



    Hello Arun,



    Thanks for the quick reply. I totally understand the CDH issue but
    figured I'd ask the broader community as well in case there was any
    upstream known issue as I've noticed some patches relating to "somewhat
    similar" issues.



    The jstack was currently on my radar but I hadn't even thought about
    tcpdump to catch weather the tasks were heartbeating or not so thanks
    for the tip, will make sure to check that out! We are also planning our
    release update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
    0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
    the issue as well, in which case I'll at least let everyone here know if
    it does.



    Any further ideas or if anyone else has experienced this similar issue
    my ears are open. Thanks again Arun! J



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: Arun C Murthy
    Sent: Thursday, December 15, 2011 2:03 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
    working)



    Hi John,



    It's hard for folks on this list to diagnose CDH (you might have to ask
    their lists). However, I haven't seen similar issues with
    hadoop-0.20.2xx in a while.



    One thing to check would be to grab a stack trace (jstack) on the tasks
    to see what they are upto. Next, try get a tcpdump to see if the tasks
    are indeed sending heartbeats to the TT, which might be the reason the
    TTs aren't timing them out.



    hth,

    Arun



    On Dec 15, 2011, at 7:58 AM, John Miller wrote:



    I've recently come across some interesting things happening within a
    50-node cluster regarding the tasktrackers and task attempts.
    Essentially tasks are being created but they are sticking at 0.0% and it
    seems the 'mapreduce.task.timeout' isn't taking effect and they just sit
    there (for days if we let them) and the jobs have to get killed. Its
    interesting to note that the HDFS datanode service and HBASE
    regionserver running on these nodes work fine and we've been simply
    shutting down the tasktracker service on them in order to get around
    jobs stalling forever.



    Some historical information... We're running Cloudera's cdh3u0 release,
    and this has so far only happened on a handful of random tasktracker
    nodes and it seems to only effected those that have been taken down for
    maintenance and then brought back into the cluster, or alternatively one
    node was brought into the cluster after it had been running for a while
    and we ran into the same issue. After re-adding the nodes back into the
    cluster the tasktracker service starts getting these stalls. Also know
    that this has not happened to every node that has been taken out of
    service for a time and then re-added... I would say about 1/3'rd of them
    or so has ran into this issue after maintenance. The particular
    maintenance issues on the effected nodes were NOT the same, i.e. one was
    bad ram another was a bad sector on a disk etc... never the same initial
    problem only the same outcome after rejoining the cluster.



    It's also never the same mapred job that sticks, nor is there any time
    related evidence relating the stalls to a specific time of day. Rather
    the node will run fine for many jobs and then just all of a sudden some
    tasks will stall and stick at 0.0%. There are no visible errors in the
    log outputs, although nothing will move forward nor will it release the
    mappers for any other jobs to use until the stalled job is killed. It
    seems that the default 'mapreduce.task.timeout' just isn't working for
    some reason.



    Has anyone come across anything similar to this? I can provide more
    details/data as needed.



    John Miller | Sr. Linux Systems Administrator

    <image001.png> <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>
  • John Miller at Dec 28, 2011 at 9:37 pm
    Actually we found the issue that occurred today. A rogue hbase
    regionserver was started and it joined the cluster which caused a whole
    bunch of issues. After removing it the cluster is working properly.



    Unfortunately this is not the same issue we've been seeing in the past
    that originally prompted this email thread, rather this was something
    that arose specifically today and was incorrectly tied to the issues we
    were seeing previously which will still need more investigation when
    they come up again.



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: John Miller
    Sent: Wednesday, December 28, 2011 2:09 PM
    To: John Miller; mapreduce-user@hadoop.apache.org
    Subject: RE: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
    working)



    Here's a jstack from the jobtracker of an instance of this issue we hit
    today. Unfortunately all of the tasks inside each mapper job have "no
    task attemps found" and there are multiple jobs stuck at 89.49% and a
    couple at 99.9% mapping. Nothing has continued for hours. Any new jobs
    submitted get stuck the same as the others. I was unable to grab a
    tcpdump at this time from the tasktrackers heartbeats since there are no
    child vm's for them.



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: John Miller
    Sent: Thursday, December 15, 2011 3:56 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: RE: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
    working)



    Hello Arun,



    Thanks for the quick reply. I totally understand the CDH issue but
    figured I'd ask the broader community as well in case there was any
    upstream known issue as I've noticed some patches relating to "somewhat
    similar" issues.



    The jstack was currently on my radar but I hadn't even thought about
    tcpdump to catch weather the tasks were heartbeating or not so thanks
    for the tip, will make sure to check that out! We are also planning our
    release update to CDH 3u2 vs. 3u0 which will give us the updated hadoop
    0.20.2+923.142 vs. our current 0.20.2+923.21 which may inadvertently fix
    the issue as well, in which case I'll at least let everyone here know if
    it does.



    Any further ideas or if anyone else has experienced this similar issue
    my ears are open. Thanks again Arun! J



    John Miller | Sr. Linux Systems Administrator

    <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>



    From: Arun C Murthy
    Sent: Thursday, December 15, 2011 2:03 PM
    To: mapreduce-user@hadoop.apache.org
    Subject: Re: Tasktracker Task Attempts Stuck (mapreduce.task.timeout not
    working)



    Hi John,



    It's hard for folks on this list to diagnose CDH (you might have to ask
    their lists). However, I haven't seen similar issues with
    hadoop-0.20.2xx in a while.



    One thing to check would be to grab a stack trace (jstack) on the tasks
    to see what they are upto. Next, try get a tcpdump to see if the tasks
    are indeed sending heartbeats to the TT, which might be the reason the
    TTs aren't timing them out.



    hth,

    Arun



    On Dec 15, 2011, at 7:58 AM, John Miller wrote:



    I've recently come across some interesting things happening within a
    50-node cluster regarding the tasktrackers and task attempts.
    Essentially tasks are being created but they are sticking at 0.0% and it
    seems the 'mapreduce.task.timeout' isn't taking effect and they just sit
    there (for days if we let them) and the jobs have to get killed. Its
    interesting to note that the HDFS datanode service and HBASE
    regionserver running on these nodes work fine and we've been simply
    shutting down the tasktracker service on them in order to get around
    jobs stalling forever.



    Some historical information... We're running Cloudera's cdh3u0 release,
    and this has so far only happened on a handful of random tasktracker
    nodes and it seems to only effected those that have been taken down for
    maintenance and then brought back into the cluster, or alternatively one
    node was brought into the cluster after it had been running for a while
    and we ran into the same issue. After re-adding the nodes back into the
    cluster the tasktracker service starts getting these stalls. Also know
    that this has not happened to every node that has been taken out of
    service for a time and then re-added... I would say about 1/3'rd of them
    or so has ran into this issue after maintenance. The particular
    maintenance issues on the effected nodes were NOT the same, i.e. one was
    bad ram another was a bad sector on a disk etc... never the same initial
    problem only the same outcome after rejoining the cluster.



    It's also never the same mapred job that sticks, nor is there any time
    related evidence relating the stalls to a specific time of day. Rather
    the node will run fine for many jobs and then just all of a sudden some
    tasks will stall and stick at 0.0%. There are no visible errors in the
    log outputs, although nothing will move forward nor will it release the
    mappers for any other jobs to use until the stalled job is killed. It
    seems that the default 'mapreduce.task.timeout' just isn't working for
    some reason.



    Has anyone come across anything similar to this? I can provide more
    details/data as needed.



    John Miller | Sr. Linux Systems Administrator

    <image001.png> <http://mybuys.com/>

    530 E. Liberty St.

    Ann Arbor, MI 48104

    Direct: 734.922.7007

    http://mybuys.com/ <http://mybuys.com/>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedDec 15, '11 at 3:58p
activeDec 28, '11 at 9:37p
posts8
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase