FAQ
On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/
task trackers) I've noticed many datanodes get 'stuck'. The nodes
themselves seem fine with no network/memory problems, but in every
instance I see two DataNode processes running, and the NameNode logs
indicate the datanode in question simply stopped responding. This
state persists until I come along and kill the DataNode processes and
restart the DataNode on that particular machine.

I'm at a loss as to why this is happening, so here's all the relevant
information I can think of sharing:

hadoop version = 0.19.1-dev, r (we possibly have some custom patches
running, but nothing which would affect HDFS that I'm aware of)
number of nodes = ~100
HDFS size = ~230TB
Java version =
OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
respectively

I managed to grab a stack dump via "kill -3" from two of these problem
instances and threw up the logs at http://cse.unl.edu/~attebury/datanode_problem/
. The .log files honestly show nothing out of the ordinary, and having
very little Java developing experience the .out files mean nothing to
me. It's also worth mentioning that the NameNode logs at the time when
these DataNodes got stuck show nothing out of the ordinary either --
just the expected "lost heartbeat from node <x>" message. The DataNode
daemon (the original process, not the second mysterious one) continues
to respond to web requests like browsing the log directory during this
time.

Whenever this happens I've just manually done a "kill -9" to remove
the two stuck DataNode processes (I'm not even sure why there's two of
them, as under normal operation there's only one). After killing the
stuck ones, I simply do a "hadoop-daemon.sh start datanode" and all is
normal again. I've not seen any dataloss or corruption as a result of
this problem.

Has anyone seen anything like this happen before? Out of our ~100 node
cluster I see this problem around once a day, and it seems to just
strike random nodes at random times. It happens often enough that I
would be happy to do additional debugging if anyone can tell me how.
I'm not a developer at all, so I'm at the end of my knowledge on how
to solve this problem. Thanks for any help!


===============================
Garhan Attebury
Systems Administrator
UNL Research Computing Facility
402-472-7761
===============================

Search Discussions

  • Philip Zeyliger at Mar 9, 2009 at 5:40 pm
    Very naively looking at the stack traces, a common theme is that there's a
    call out to "df" to find the system capacity. If you see two data node
    processes, perhaps the fork/exec to call out to "df" is failing in some
    strange way.

    "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
    tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
    [0x0000000042c54000..0x0000000042c54b30]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
    - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
    at java.lang.UNIXProcess.(ProcessImpl.java:65)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
    - locked <0x0000002a9ed97078> (a
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
    at java.lang.Thread.run(Thread.java:619)


    On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury wrote:

    On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
    trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
    fine with no network/memory problems, but in every instance I see two
    DataNode processes running, and the NameNode logs indicate the datanode in
    question simply stopped responding. This state persists until I come along
    and kill the DataNode processes and restart the DataNode on that particular
    machine.

    I'm at a loss as to why this is happening, so here's all the relevant
    information I can think of sharing:

    hadoop version = 0.19.1-dev, r (we possibly have some custom patches
    running, but nothing which would affect HDFS that I'm aware of)
    number of nodes = ~100
    HDFS size = ~230TB
    Java version =
    OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
    respectively

    I managed to grab a stack dump via "kill -3" from two of these problem
    instances and threw up the logs at
    http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>.
    The .log files honestly show nothing out of the ordinary, and having very
    little Java developing experience the .out files mean nothing to me. It's
    also worth mentioning that the NameNode logs at the time when these
    DataNodes got stuck show nothing out of the ordinary either -- just the
    expected "lost heartbeat from node <x>" message. The DataNode daemon (the
    original process, not the second mysterious one) continues to respond to web
    requests like browsing the log directory during this time.

    Whenever this happens I've just manually done a "kill -9" to remove the two
    stuck DataNode processes (I'm not even sure why there's two of them, as
    under normal operation there's only one). After killing the stuck ones, I
    simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
    not seen any dataloss or corruption as a result of this problem.

    Has anyone seen anything like this happen before? Out of our ~100 node
    cluster I see this problem around once a day, and it seems to just strike
    random nodes at random times. It happens often enough that I would be happy
    to do additional debugging if anyone can tell me how. I'm not a developer at
    all, so I'm at the end of my knowledge on how to solve this problem. Thanks
    for any help!


    ===============================
    Garhan Attebury
    Systems Administrator
    UNL Research Computing Facility
    402-472-7761
    ===============================
  • Steve Loughran at Mar 9, 2009 at 5:49 pm

    Philip Zeyliger wrote:
    Very naively looking at the stack traces, a common theme is that there's a
    call out to "df" to find the system capacity. If you see two data node
    processes, perhaps the fork/exec to call out to "df" is failing in some
    strange way.
    that's deep into Java code. OpenJDK gives you more of that source. One
    option here is to consider some kind of timeouts in the exec, but it's
    pretty tricky to tack that on round the Java runtime APIs, because the
    process APIs weren't really designed to be interrupted by other threads.

    -steve
    "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
    tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
    [0x0000000042c54000..0x0000000042c54b30]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
    - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
    at java.lang.ProcessImpl.start(ProcessImpl.java:65)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
    - locked <0x0000002a9ed97078> (a
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
    at java.lang.Thread.run(Thread.java:619)


    On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury wrote:

    On a ~100 node cluster running HDFS (we just use HDFS + fuse, no job/task
    trackers) I've noticed many datanodes get 'stuck'. The nodes themselves seem
    fine with no network/memory problems, but in every instance I see two
    DataNode processes running, and the NameNode logs indicate the datanode in
    question simply stopped responding. This state persists until I come along
    and kill the DataNode processes and restart the DataNode on that particular
    machine.

    I'm at a loss as to why this is happening, so here's all the relevant
    information I can think of sharing:

    hadoop version = 0.19.1-dev, r (we possibly have some custom patches
    running, but nothing which would affect HDFS that I'm aware of)
    number of nodes = ~100
    HDFS size = ~230TB
    Java version =
    OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
    respectively

    I managed to grab a stack dump via "kill -3" from two of these problem
    instances and threw up the logs at
    http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>.
    The .log files honestly show nothing out of the ordinary, and having very
    little Java developing experience the .out files mean nothing to me. It's
    also worth mentioning that the NameNode logs at the time when these
    DataNodes got stuck show nothing out of the ordinary either -- just the
    expected "lost heartbeat from node <x>" message. The DataNode daemon (the
    original process, not the second mysterious one) continues to respond to web
    requests like browsing the log directory during this time.

    Whenever this happens I've just manually done a "kill -9" to remove the two
    stuck DataNode processes (I'm not even sure why there's two of them, as
    under normal operation there's only one). After killing the stuck ones, I
    simply do a "hadoop-daemon.sh start datanode" and all is normal again. I've
    not seen any dataloss or corruption as a result of this problem.

    Has anyone seen anything like this happen before? Out of our ~100 node
    cluster I see this problem around once a day, and it seems to just strike
    random nodes at random times. It happens often enough that I would be happy
    to do additional debugging if anyone can tell me how. I'm not a developer at
    all, so I'm at the end of my knowledge on how to solve this problem. Thanks
    for any help!


    ===============================
    Garhan Attebury
    Systems Administrator
    UNL Research Computing Facility
    402-472-7761
    ===============================

    --
    Steve Loughran http://www.1060.org/blogxter/publish/5
    Author: Ant in Action http://antbook.org/
  • Brian Bockelman at Mar 9, 2009 at 6:24 pm
    It's very strange. It appears that the second process is the result
    of a fork call, yet has only one thread running whose gdb backtrace
    looks like this:

    (gdb) bt
    #0 0x0000003e10c0af8b in __lll_mutex_lock_wait () from /lib64/tls/
    libpthread.so.0
    #1 0x0000000000000000 in ?? ()

    Not very helpful! I'd normally suspect some strange memory issue, but
    I've checked - there was plenty of memory available on the host when
    the second process was spawned and we weren't close to the file
    descriptor limit.


    Looking at this issue,
    https://issues.apache.org/jira/browse/HADOOP-2231

    it seems that the "df" call is avoidable now that we're in Java 1.6.
    However, the issue was closed and marked as a duplicate, but without
    noting what it was a duplicate of (grrr). Is there an updated version
    of that patch?

    Brian
    On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:

    Philip Zeyliger wrote:
    Very naively looking at the stack traces, a common theme is that
    there's a
    call out to "df" to find the system capacity. If you see two data
    node
    processes, perhaps the fork/exec to call out to "df" is failing in
    some
    strange way.
    that's deep into Java code. OpenJDK gives you more of that source.
    One option here is to consider some kind of timeouts in the exec,
    but it's pretty tricky to tack that on round the Java runtime APIs,
    because the process APIs weren't really designed to be interrupted
    by other threads.

    -steve
    "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
    tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
    [0x0000000042c54000..0x0000000042c54b30]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
    - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
    at java.lang.ProcessImpl.start(ProcessImpl.java:65)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset
    $FSVolume.getCapacity(FSDataset.java:341)
    at org.apache.hadoop.hdfs.server.datanode.FSDataset
    $FSVolumeSet.getCapacity(FSDataset.java:501)
    - locked <0x0000002a9ed97078> (a
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
    at
    org
    .apache
    .hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:
    697)
    at
    org
    .apache
    .hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
    at
    org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:
    1105)
    at java.lang.Thread.run(Thread.java:619)
    On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury
    wrote:
    On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
    job/task
    trackers) I've noticed many datanodes get 'stuck'. The nodes
    themselves seem
    fine with no network/memory problems, but in every instance I see
    two
    DataNode processes running, and the NameNode logs indicate the
    datanode in
    question simply stopped responding. This state persists until I
    come along
    and kill the DataNode processes and restart the DataNode on that
    particular
    machine.

    I'm at a loss as to why this is happening, so here's all the
    relevant
    information I can think of sharing:

    hadoop version = 0.19.1-dev, r (we possibly have some custom patches
    running, but nothing which would affect HDFS that I'm aware of)
    number of nodes = ~100
    HDFS size = ~230TB
    Java version =
    OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
    respectively

    I managed to grab a stack dump via "kill -3" from two of these
    problem
    instances and threw up the logs at
    http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/
    .
    The .log files honestly show nothing out of the ordinary, and
    having very
    little Java developing experience the .out files mean nothing to
    me. It's
    also worth mentioning that the NameNode logs at the time when these
    DataNodes got stuck show nothing out of the ordinary either --
    just the
    expected "lost heartbeat from node <x>" message. The DataNode
    daemon (the
    original process, not the second mysterious one) continues to
    respond to web
    requests like browsing the log directory during this time.

    Whenever this happens I've just manually done a "kill -9" to
    remove the two
    stuck DataNode processes (I'm not even sure why there's two of
    them, as
    under normal operation there's only one). After killing the stuck
    ones, I
    simply do a "hadoop-daemon.sh start datanode" and all is normal
    again. I've
    not seen any dataloss or corruption as a result of this problem.

    Has anyone seen anything like this happen before? Out of our ~100
    node
    cluster I see this problem around once a day, and it seems to just
    strike
    random nodes at random times. It happens often enough that I would
    be happy
    to do additional debugging if anyone can tell me how. I'm not a
    developer at
    all, so I'm at the end of my knowledge on how to solve this
    problem. Thanks
    for any help!


    ===============================
    Garhan Attebury
    Systems Administrator
    UNL Research Computing Facility
    402-472-7761
    ===============================

    --
    Steve Loughran http://www.1060.org/blogxter/publish/5
    Author: Ant in Action http://antbook.org/
  • Jason hadoop at Mar 9, 2009 at 7:21 pm
    There were a couple of fork timing errors in the jdk 1.5 that occasionally
    caused a sub process fork to go bad, this could by the du/df being forked
    off by the datanode and dying.

    I can't find the references I had saved away at one point, from the java
    forums, but perhaps this will get you started.

    http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0
    On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman wrote:

    It's very strange. It appears that the second process is the result of a
    fork call, yet has only one thread running whose gdb backtrace looks like
    this:

    (gdb) bt
    #0 0x0000003e10c0af8b in __lll_mutex_lock_wait () from
    /lib64/tls/libpthread.so.0
    #1 0x0000000000000000 in ?? ()

    Not very helpful! I'd normally suspect some strange memory issue, but I've
    checked - there was plenty of memory available on the host when the second
    process was spawned and we weren't close to the file descriptor limit.


    Looking at this issue,
    https://issues.apache.org/jira/browse/HADOOP-2231

    it seems that the "df" call is avoidable now that we're in Java 1.6.
    However, the issue was closed and marked as a duplicate, but without noting
    what it was a duplicate of (grrr). Is there an updated version of that
    patch?

    Brian


    On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:

    Philip Zeyliger wrote:
    Very naively looking at the stack traces, a common theme is that there's
    a
    call out to "df" to find the system capacity. If you see two data node
    processes, perhaps the fork/exec to call out to "df" is failing in some
    strange way.
    that's deep into Java code. OpenJDK gives you more of that source. One
    option here is to consider some kind of timeouts in the exec, but it's
    pretty tricky to tack that on round the Java runtime APIs, because the
    process APIs weren't really designed to be interrupted by other threads.

    -steve

    "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
    tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
    [0x0000000042c54000..0x0000000042c54b30]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
    - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
    at java.lang.ProcessImpl.start(ProcessImpl.java:65)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
    at
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
    at
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
    - locked <0x0000002a9ed97078> (a
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
    at
    org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
    at
    org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
    at
    org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
    at java.lang.Thread.run(Thread.java:619)
    On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu
    wrote:
    On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
    job/task
    trackers) I've noticed many datanodes get 'stuck'. The nodes themselves
    seem
    fine with no network/memory problems, but in every instance I see two
    DataNode processes running, and the NameNode logs indicate the datanode
    in
    question simply stopped responding. This state persists until I come
    along
    and kill the DataNode processes and restart the DataNode on that
    particular
    machine.

    I'm at a loss as to why this is happening, so here's all the relevant
    information I can think of sharing:

    hadoop version = 0.19.1-dev, r (we possibly have some custom patches
    running, but nothing which would affect HDFS that I'm aware of)
    number of nodes = ~100
    HDFS size = ~230TB
    Java version =
    OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
    respectively

    I managed to grab a stack dump via "kill -3" from two of these problem
    instances and threw up the logs at
    http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>
    <http://cse.unl.edu/%7Eattebury/datanode_problem/>.
    The .log files honestly show nothing out of the ordinary, and having
    very
    little Java developing experience the .out files mean nothing to me.
    It's
    also worth mentioning that the NameNode logs at the time when these
    DataNodes got stuck show nothing out of the ordinary either -- just the
    expected "lost heartbeat from node <x>" message. The DataNode daemon
    (the
    original process, not the second mysterious one) continues to respond to
    web
    requests like browsing the log directory during this time.

    Whenever this happens I've just manually done a "kill -9" to remove the
    two
    stuck DataNode processes (I'm not even sure why there's two of them, as
    under normal operation there's only one). After killing the stuck ones,
    I
    simply do a "hadoop-daemon.sh start datanode" and all is normal again.
    I've
    not seen any dataloss or corruption as a result of this problem.

    Has anyone seen anything like this happen before? Out of our ~100 node
    cluster I see this problem around once a day, and it seems to just
    strike
    random nodes at random times. It happens often enough that I would be
    happy
    to do additional debugging if anyone can tell me how. I'm not a
    developer at
    all, so I'm at the end of my knowledge on how to solve this problem.
    Thanks
    for any help!


    ===============================
    Garhan Attebury
    Systems Administrator
    UNL Research Computing Facility
    402-472-7761
    ===============================

    --
    Steve Loughran http://www.1060.org/blogxter/publish/5
    Author: Ant in Action http://antbook.org/

    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422
  • Brian Bockelman at Mar 9, 2009 at 7:45 pm
    Hey Jason,

    Looks like you're on the right track. We are getting stuck after the
    JVM forks but before it exec's. It appears to be this bug:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051

    However, this is using Java 1.6.0_11, and that bug was marked as fixed
    in 1.6.0_6 :(

    Any other ideas?

    Brian
    On Mar 9, 2009, at 2:21 PM, jason hadoop wrote:

    There were a couple of fork timing errors in the jdk 1.5 that
    occasionally
    caused a sub process fork to go bad, this could by the du/df being
    forked
    off by the datanode and dying.

    I can't find the references I had saved away at one point, from the
    java
    forums, but perhaps this will get you started.

    http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0

    On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman
    wrote:
    It's very strange. It appears that the second process is the
    result of a
    fork call, yet has only one thread running whose gdb backtrace
    looks like
    this:

    (gdb) bt
    #0 0x0000003e10c0af8b in __lll_mutex_lock_wait () from
    /lib64/tls/libpthread.so.0
    #1 0x0000000000000000 in ?? ()

    Not very helpful! I'd normally suspect some strange memory issue,
    but I've
    checked - there was plenty of memory available on the host when the
    second
    process was spawned and we weren't close to the file descriptor
    limit.


    Looking at this issue,
    https://issues.apache.org/jira/browse/HADOOP-2231

    it seems that the "df" call is avoidable now that we're in Java 1.6.
    However, the issue was closed and marked as a duplicate, but
    without noting
    what it was a duplicate of (grrr). Is there an updated version of
    that
    patch?

    Brian


    On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:

    Philip Zeyliger wrote:
    Very naively looking at the stack traces, a common theme is that
    there's
    a
    call out to "df" to find the system capacity. If you see two
    data node
    processes, perhaps the fork/exec to call out to "df" is failing
    in some
    strange way.
    that's deep into Java code. OpenJDK gives you more of that source.
    One
    option here is to consider some kind of timeouts in the exec, but
    it's
    pretty tricky to tack that on round the Java runtime APIs, because
    the
    process APIs weren't really designed to be interrupted by other
    threads.

    -steve

    "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
    tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
    [0x0000000042c54000..0x0000000042c54b30]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:
    64)
    - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
    at java.lang.ProcessImpl.start(ProcessImpl.java:65)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
    at
    org.apache.hadoop.hdfs.server.datanode.FSDataset
    $FSVolume.getCapacity(FSDataset.java:341)
    at
    org.apache.hadoop.hdfs.server.datanode.FSDataset
    $FSVolumeSet.getCapacity(FSDataset.java:501)
    - locked <0x0000002a9ed97078> (a
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
    at
    org
    .apache
    .hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:
    697)
    at
    org
    .apache
    .hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:
    671)
    at
    org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:
    1105)
    at java.lang.Thread.run(Thread.java:619)
    On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu
    wrote:
    On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
    job/task
    trackers) I've noticed many datanodes get 'stuck'. The nodes
    themselves
    seem
    fine with no network/memory problems, but in every instance I
    see two
    DataNode processes running, and the NameNode logs indicate the
    datanode
    in
    question simply stopped responding. This state persists until I
    come
    along
    and kill the DataNode processes and restart the DataNode on that
    particular
    machine.

    I'm at a loss as to why this is happening, so here's all the
    relevant
    information I can think of sharing:

    hadoop version = 0.19.1-dev, r (we possibly have some custom
    patches
    running, but nothing which would affect HDFS that I'm aware of)
    number of nodes = ~100
    HDFS size = ~230TB
    Java version =
    OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
    respectively

    I managed to grab a stack dump via "kill -3" from two of these
    problem
    instances and threw up the logs at
    http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/
    <http://cse.unl.edu/%7Eattebury/datanode_problem/>.
    The .log files honestly show nothing out of the ordinary, and
    having
    very
    little Java developing experience the .out files mean nothing to
    me.
    It's
    also worth mentioning that the NameNode logs at the time when
    these
    DataNodes got stuck show nothing out of the ordinary either --
    just the
    expected "lost heartbeat from node <x>" message. The DataNode
    daemon
    (the
    original process, not the second mysterious one) continues to
    respond to
    web
    requests like browsing the log directory during this time.

    Whenever this happens I've just manually done a "kill -9" to
    remove the
    two
    stuck DataNode processes (I'm not even sure why there's two of
    them, as
    under normal operation there's only one). After killing the
    stuck ones,
    I
    simply do a "hadoop-daemon.sh start datanode" and all is normal
    again.
    I've
    not seen any dataloss or corruption as a result of this problem.

    Has anyone seen anything like this happen before? Out of our
    ~100 node
    cluster I see this problem around once a day, and it seems to just
    strike
    random nodes at random times. It happens often enough that I
    would be
    happy
    to do additional debugging if anyone can tell me how. I'm not a
    developer at
    all, so I'm at the end of my knowledge on how to solve this
    problem.
    Thanks
    for any help!


    ===============================
    Garhan Attebury
    Systems Administrator
    UNL Research Computing Facility
    402-472-7761
    ===============================

    --
    Steve Loughran http://www.1060.org/blogxter/publish/5
    Author: Ant in Action http://antbook.org/

    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422
  • Jason hadoop at Mar 9, 2009 at 8:22 pm
    I think the bug probably still exists and the fix only covers one trigger.
    On Mon, Mar 9, 2009 at 12:45 PM, Brian Bockelman wrote:

    Hey Jason,

    Looks like you're on the right track. We are getting stuck after the JVM
    forks but before it exec's. It appears to be this bug:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051

    However, this is using Java 1.6.0_11, and that bug was marked as fixed in
    1.6.0_6 :(

    Any other ideas?

    Brian


    On Mar 9, 2009, at 2:21 PM, jason hadoop wrote:

    There were a couple of fork timing errors in the jdk 1.5 that occasionally
    caused a sub process fork to go bad, this could by the du/df being forked
    off by the datanode and dying.

    I can't find the references I had saved away at one point, from the java
    forums, but perhaps this will get you started.

    http://forums.sun.com/thread.jspa?threadID=5297465&tstart=0

    On Mon, Mar 9, 2009 at 11:23 AM, Brian Bockelman <bbockelm@cse.unl.edu
    wrote:
    It's very strange. It appears that the second process is the result of a
    fork call, yet has only one thread running whose gdb backtrace looks like
    this:

    (gdb) bt
    #0 0x0000003e10c0af8b in __lll_mutex_lock_wait () from
    /lib64/tls/libpthread.so.0
    #1 0x0000000000000000 in ?? ()

    Not very helpful! I'd normally suspect some strange memory issue, but
    I've
    checked - there was plenty of memory available on the host when the
    second
    process was spawned and we weren't close to the file descriptor limit.


    Looking at this issue,
    https://issues.apache.org/jira/browse/HADOOP-2231

    it seems that the "df" call is avoidable now that we're in Java 1.6.
    However, the issue was closed and marked as a duplicate, but without
    noting
    what it was a duplicate of (grrr). Is there an updated version of that
    patch?

    Brian


    On Mar 9, 2009, at 12:48 PM, Steve Loughran wrote:

    Philip Zeyliger wrote:
    Very naively looking at the stack traces, a common theme is that
    there's
    a
    call out to "df" to find the system capacity. If you see two data node
    processes, perhaps the fork/exec to call out to "df" is failing in some
    strange way.
    that's deep into Java code. OpenJDK gives you more of that source. One
    option here is to consider some kind of timeouts in the exec, but it's
    pretty tricky to tack that on round the Java runtime APIs, because the
    process APIs weren't really designed to be interrupted by other threads.

    -steve

    "DataNode: [/hadoop-data/dfs/data]" daemon prio=10
    tid=0x0000002ae2c0d400 nid=0x21cf in Object.wait()
    [0x0000000042c54000..0x0000000042c54b30]
    java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:485)
    at java.lang.UNIXProcess$Gate.waitForExit(UNIXProcess.java:64)
    - locked <0x0000002a9fd84f98> (a java.lang.UNIXProcess$Gate)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:145)
    at java.lang.ProcessImpl.start(ProcessImpl.java:65)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.run(Shell.java:134)
    at org.apache.hadoop.fs.DF.getCapacity(DF.java:63)
    at

    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.getCapacity(FSDataset.java:341)
    at

    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.getCapacity(FSDataset.java:501)
    - locked <0x0000002a9ed97078> (a
    org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet)
    at

    org.apache.hadoop.hdfs.server.datanode.FSDataset.getCapacity(FSDataset.java:697)
    at

    org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:671)
    at
    org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1105)
    at java.lang.Thread.run(Thread.java:619)
    On Mon, Mar 9, 2009 at 8:17 AM, Garhan Attebury <attebury@cse.unl.edu
    wrote:
    On a ~100 node cluster running HDFS (we just use HDFS + fuse, no
    job/task
    trackers) I've noticed many datanodes get 'stuck'. The nodes
    themselves
    seem
    fine with no network/memory problems, but in every instance I see two
    DataNode processes running, and the NameNode logs indicate the
    datanode
    in
    question simply stopped responding. This state persists until I come
    along
    and kill the DataNode processes and restart the DataNode on that
    particular
    machine.

    I'm at a loss as to why this is happening, so here's all the relevant
    information I can think of sharing:

    hadoop version = 0.19.1-dev, r (we possibly have some custom patches
    running, but nothing which would affect HDFS that I'm aware of)
    number of nodes = ~100
    HDFS size = ~230TB
    Java version =
    OS = CentOS 4.7 x86_64, 4/8 core Opterons with 4GB/16GB of memory
    respectively

    I managed to grab a stack dump via "kill -3" from two of these problem
    instances and threw up the logs at
    http://cse.unl.edu/~attebury/datanode_problem/<http://cse.unl.edu/%7Eattebury/datanode_problem/>
    <http://cse.unl.edu/%7Eattebury/datanode_problem/>
    <http://cse.unl.edu/%7Eattebury/datanode_problem/>.
    The .log files honestly show nothing out of the ordinary, and having
    very
    little Java developing experience the .out files mean nothing to me.
    It's
    also worth mentioning that the NameNode logs at the time when these
    DataNodes got stuck show nothing out of the ordinary either -- just
    the
    expected "lost heartbeat from node <x>" message. The DataNode daemon
    (the
    original process, not the second mysterious one) continues to respond
    to
    web
    requests like browsing the log directory during this time.

    Whenever this happens I've just manually done a "kill -9" to remove
    the
    two
    stuck DataNode processes (I'm not even sure why there's two of them,
    as
    under normal operation there's only one). After killing the stuck
    ones,
    I
    simply do a "hadoop-daemon.sh start datanode" and all is normal again.
    I've
    not seen any dataloss or corruption as a result of this problem.

    Has anyone seen anything like this happen before? Out of our ~100 node
    cluster I see this problem around once a day, and it seems to just
    strike
    random nodes at random times. It happens often enough that I would be
    happy
    to do additional debugging if anyone can tell me how. I'm not a
    developer at
    all, so I'm at the end of my knowledge on how to solve this problem.
    Thanks
    for any help!


    ===============================
    Garhan Attebury
    Systems Administrator
    UNL Research Computing Facility
    402-472-7761
    ===============================


    --
    Steve Loughran http://www.1060.org/blogxter/publish/5
    Author: Ant in Action http://antbook.org/
    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422

    --
    Alpha Chapters of my book on Hadoop are available
    http://www.apress.com/book/view/9781430219422

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 9, '09 at 3:17p
activeMar 9, '09 at 8:22p
posts7
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase