FAQ
Hi,

i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into a situation where every task scheduled on 2 of the 4 nodes failed.
Seems like the child jvm crashes. There are no child logs under logs/userlogs. Tasktracker gives this:

2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201006091425_0049_m_-946174604 spawned.
2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner: attempt_201006091425_0049_m_003179_0 Child Error
java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job created the logs/userlogs again and no error ocuured anymore on this host.
The permissions of userlogs and userlogsOLD are exactly the same. userlogsOLD contains about 378M in 132747 files. When copying the content of userlogsOLD into userlogs, the tasks of the belonging node starts failing again.

Some questions:
- this seems to me like a problem with too many files in one folder - any thoughts on this ?
- is the content of logs/userlogs cleaned up by hadoop regularly ?
- the logs/stdout file of the tasks are not existent, the logs/out fiels of the tasktracker hasn't any specific message (other then message posted above) - is there any log file left where an error message could be found ?


best regards
Johannes

Search Discussions

  • Edward Capriolo at Jun 14, 2010 at 5:48 pm

    On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann wrote:

    Hi,

    i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
    a situation where every task scheduled on 2 of the 4 nodes failed.
    Seems like the child jvm crashes. There are no child logs under
    logs/userlogs. Tasktracker gives this:

    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
    JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
    Runner jvm_201006091425_0049_m_-946174604 spawned.
    2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
    jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
    2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_201006091425_0049_m_003179_0 Child Error
    java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


    At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
    created the logs/userlogs again and no error ocuured anymore on this host.
    The permissions of userlogs and userlogsOLD are exactly the same.
    userlogsOLD contains about 378M in 132747 files. When copying the content of
    userlogsOLD into userlogs, the tasks of the belonging node starts failing
    again.

    Some questions:
    - this seems to me like a problem with too many files in one folder - any
    thoughts on this ?
    - is the content of logs/userlogs cleaned up by hadoop regularly ?
    - the logs/stdout file of the tasks are not existent, the logs/out fiels of
    the tasktracker hasn't any specific message (other then message posted
    above) - is there any log file left where an error message could be found ?


    best regards
    Johannes

    Most file systems have an upper limit on number of subfiles/folders in a
    folder. You have probably hit the EXT3 limit. If you launch lots and lots of
    jobs you can hit the limit before any cleanup happens.

    You can experiment with cleanup and other filesystems. The following log
    related issue might be relevant.

    https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

    Regards,
    Edward
  • Russell Brown at Jun 14, 2010 at 5:59 pm
    I'm a new user of Hadoop. I have a Linux cluster with both gigabit
    ethernet and InfiniBand communications interfaces. Could someone please
    tell me how to switch IP communication from ethernet (the default) to
    InfiniBand? Thanks.

    --

    ------------------------------------------------------------
    Russell A. Brown | Oracle
    russ.brown@oracle.com | UMPK14-260
    (650) 786-3011 (office) | 14 Network Circle
    (650) 786-3453 (fax) | Menlo Park, CA 94025
    ------------------------------------------------------------
  • Allen Wittenauer at Jun 14, 2010 at 7:33 pm

    On Jun 14, 2010, at 10:57 AM, Russell Brown wrote:

    I'm a new user of Hadoop. I have a Linux cluster with both gigabit ethernet and InfiniBand communications interfaces. Could someone please tell me how to switch IP communication from ethernet (the default) to InfiniBand? Thanks.

    Hadoop will bind inbound connections via the interface settings in the various hadoop configuration files. Outbound connections are unbound and based solely on OS configuration. I filed a jira to fix this, but it is obviously low priority since few people run multi-nic boxes. Best bet is to down the ethernet and up the IB, changing routing, etc, as necessary.
  • Russell Brown at Jun 15, 2010 at 2:42 pm
    Thanks, Allen, for responding.

    So, if I understand you correctly, the dfs.datanode.dns.interface and
    mapred.tasktracker.dns.interface options may be used to define inbound
    connections only?

    Concerning the OS configuration, my /etc/hosts files assign unique host
    names to the ethernet and IB interfaces. However, even if I specify the
    IB host names in the masters and slaves files, communication still
    occurs via ethernet, not via IB.

    Your recommendation would therefore be to define IB instead of ethernet
    as the default network interface connection, right?

    Thanks,

    Russ

    On 06/14/10 12:32 PM, Allen Wittenauer wrote:
    On Jun 14, 2010, at 10:57 AM, Russell Brown wrote:

    I'm a new user of Hadoop. I have a Linux cluster with both gigabit ethernet and InfiniBand communications interfaces. Could someone please tell me how to switch IP communication from ethernet (the default) to InfiniBand? Thanks.

    Hadoop will bind inbound connections via the interface settings in the various hadoop configuration files. Outbound connections are unbound and based solely on OS configuration. I filed a jira to fix this, but it is obviously low priority since few people run multi-nic boxes. Best bet is to down the ethernet and up the IB, changing routing, etc, as necessary.

    --

    ------------------------------------------------------------
    Russell A. Brown | Oracle
    russ.brown@oracle.com | UMPK14-260
    (650) 786-3011 (office) | 14 Network Circle
    (650) 786-3453 (fax) | Menlo Park, CA 94025
    ------------------------------------------------------------
  • Allen Wittenauer at Jun 15, 2010 at 6:10 pm

    On Jun 15, 2010, at 7:40 AM, Russell Brown wrote:

    Thanks, Allen, for responding.

    So, if I understand you correctly, the dfs.datanode.dns.interface and mapred.tasktracker.dns.interface options may be used to define inbound connections only?
    Correct. The daemons will bind to those interfaces and use those names as their 'official' connection in.
    Concerning the OS configuration, my /etc/hosts files assign unique host names to the ethernet and IB interfaces. However, even if I specify the IB host names in the masters and slaves files, communication still occurs via ethernet, not via IB.
    BTW, are you doing this on Solaris or Linux?

    Solaris is notorious for not honoring inbound and outbound interfaces. [In other words, just because the packet came in on bge0, that is no guarantee that the reply will go out on bge0 if another route is available. Particularly frustrating with NFS and SunCluster.]
    Your recommendation would therefore be to define IB instead of ethernet as the default network interface connection, right?
    Yup. Or at least give it a lower cost in the routing table.
  • Russell Brown at Jun 15, 2010 at 6:22 pm
    FYI, Allen Wittnauer,

    I'm using Linux not Solaris, but I'll pay attention to your comment
    about Solaris if I install Solaris on the cluster. Thanks again for
    your helpful comments.

    Russ
    On 06/15/10 11:10 AM, Allen Wittenauer wrote:
    On Jun 15, 2010, at 7:40 AM, Russell Brown wrote:

    Thanks, Allen, for responding.

    So, if I understand you correctly, the dfs.datanode.dns.interface and mapred.tasktracker.dns.interface options may be used to define inbound connections only?
    Correct. The daemons will bind to those interfaces and use those names as their 'official' connection in.

    Concerning the OS configuration, my /etc/hosts files assign unique host names to the ethernet and IB interfaces. However, even if I specify the IB host names in the masters and slaves files, communication still occurs via ethernet, not via IB.
    BTW, are you doing this on Solaris or Linux?

    Solaris is notorious for not honoring inbound and outbound interfaces. [In other words, just because the packet came in on bge0, that is no guarantee that the reply will go out on bge0 if another route is available. Particularly frustrating with NFS and SunCluster.]

    Your recommendation would therefore be to define IB instead of ethernet as the default network interface connection, right?
    Yup. Or at least give it a lower cost in the routing table.

    --

    ------------------------------------------------------------
    Russell A. Brown | Oracle
    russ.brown@oracle.com | UMPK14-260
    (650) 786-3011 (office) | 14 Network Circle
    (650) 786-3453 (fax) | Menlo Park, CA 94025
    ------------------------------------------------------------
  • Johannes Zillmann at Jun 16, 2010 at 7:17 am
    Hi Edward,

    i copied the userlogs folder which caused the error.
    Two things which is speak against the too-many files theory.
    a) i can add new files to this folder (touch userlogsOLD/a, etc... )
    b) the sysctl fs.file-max shows 817874 whereas the file count on the first level of userlogsOLD is 31999 and all files recursively are 107400.

    Any thoughts ?
    Johannes

    On Jun 14, 2010, at 7:47 PM, Edward Capriolo wrote:

    On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <jzillmann@googlemail.com
    wrote:
    Hi,

    i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
    a situation where every task scheduled on 2 of the 4 nodes failed.
    Seems like the child jvm crashes. There are no child logs under
    logs/userlogs. Tasktracker gives this:

    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
    JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
    Runner jvm_201006091425_0049_m_-946174604 spawned.
    2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
    jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
    2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_201006091425_0049_m_003179_0 Child Error
    java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


    At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
    created the logs/userlogs again and no error ocuured anymore on this host.
    The permissions of userlogs and userlogsOLD are exactly the same.
    userlogsOLD contains about 378M in 132747 files. When copying the content of
    userlogsOLD into userlogs, the tasks of the belonging node starts failing
    again.

    Some questions:
    - this seems to me like a problem with too many files in one folder - any
    thoughts on this ?
    - is the content of logs/userlogs cleaned up by hadoop regularly ?
    - the logs/stdout file of the tasks are not existent, the logs/out fiels of
    the tasktracker hasn't any specific message (other then message posted
    above) - is there any log file left where an error message could be found ?


    best regards
    Johannes

    Most file systems have an upper limit on number of subfiles/folders in a
    folder. You have probably hit the EXT3 limit. If you launch lots and lots of
    jobs you can hit the limit before any cleanup happens.

    You can experiment with cleanup and other filesystems. The following log
    related issue might be relevant.

    https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

    Regards,
    Edward
  • Amareshwari Sri Ramadasu at Jun 16, 2010 at 7:23 am
    The issue is fixed in branch 0.21 through http://issues.apache.org/jira/browse/MAPREDUCE-927.
    Now, the attempt directories are moved inside job directory. So, userlogs directory will have only job directories.

    Thanks
    Amareshwari
    On 6/16/10 12:47 PM, "Johannes Zillmann" wrote:

    Hi Edward,

    i copied the userlogs folder which caused the error.
    Two things which is speak against the too-many files theory.
    a) i can add new files to this folder (touch userlogsOLD/a, etc... )
    b) the sysctl fs.file-max shows 817874 whereas the file count on the first level of userlogsOLD is 31999 and all files recursively are 107400.

    Any thoughts ?
    Johannes

    On Jun 14, 2010, at 7:47 PM, Edward Capriolo wrote:

    On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <jzillmann@googlemail.com
    wrote:
    Hi,

    i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
    a situation where every task scheduled on 2 of the 4 nodes failed.
    Seems like the child jvm crashes. There are no child logs under
    logs/userlogs. Tasktracker gives this:

    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
    JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
    Runner jvm_201006091425_0049_m_-946174604 spawned.
    2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
    jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
    2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_201006091425_0049_m_003179_0 Child Error
    java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


    At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
    created the logs/userlogs again and no error ocuured anymore on this host.
    The permissions of userlogs and userlogsOLD are exactly the same.
    userlogsOLD contains about 378M in 132747 files. When copying the content of
    userlogsOLD into userlogs, the tasks of the belonging node starts failing
    again.

    Some questions:
    - this seems to me like a problem with too many files in one folder - any
    thoughts on this ?
    - is the content of logs/userlogs cleaned up by hadoop regularly ?
    - the logs/stdout file of the tasks are not existent, the logs/out fiels of
    the tasktracker hasn't any specific message (other then message posted
    above) - is there any log file left where an error message could be found ?


    best regards
    Johannes

    Most file systems have an upper limit on number of subfiles/folders in a
    folder. You have probably hit the EXT3 limit. If you launch lots and lots of
    jobs you can hit the limit before any cleanup happens.

    You can experiment with cleanup and other filesystems. The following log
    related issue might be relevant.

    https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

    Regards,
    Edward
  • Manhee Jo at Jun 16, 2010 at 7:31 am
    Hi,

    I've also encountered the same nonzero status of 1 error before.
    What did you set to mapred.child.ulimit and mapred.child.java.opts?
    mapred.child.ulimit must be greater than the -Xmx passed to JavaVM,
    else the VM might not start. That's wat MR tutorial says.
    Setting bigger ulimit, I could solve the problem.
    Hope this help.


    Regards,
    Manhee

    ----- Original Message -----
    From: "Edward Capriolo" <edlinuxguru@gmail.com>
    To: <common-user@hadoop.apache.org>
    Sent: Tuesday, June 15, 2010 2:47 AM
    Subject: Re: Task process exit with nonzero status of 1 - deleting
    userlogshelps

    On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann
    <jzillmann@googlemail.com
    wrote:
    Hi,

    i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run
    into
    a situation where every task scheduled on 2 of the 4 nodes failed.
    Seems like the child jvm crashes. There are no child logs under
    logs/userlogs. Tasktracker gives this:

    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
    JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
    Runner jvm_201006091425_0049_m_-946174604 spawned.
    2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
    jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
    2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_201006091425_0049_m_003179_0 Child Error
    java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


    At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new
    job
    created the logs/userlogs again and no error ocuured anymore on this
    host.
    The permissions of userlogs and userlogsOLD are exactly the same.
    userlogsOLD contains about 378M in 132747 files. When copying the content
    of
    userlogsOLD into userlogs, the tasks of the belonging node starts failing
    again.

    Some questions:
    - this seems to me like a problem with too many files in one folder - any
    thoughts on this ?
    - is the content of logs/userlogs cleaned up by hadoop regularly ?
    - the logs/stdout file of the tasks are not existent, the logs/out fiels
    of
    the tasktracker hasn't any specific message (other then message posted
    above) - is there any log file left where an error message could be found
    ?


    best regards
    Johannes

    Most file systems have an upper limit on number of subfiles/folders in a
    folder. You have probably hit the EXT3 limit. If you launch lots and lots
    of
    jobs you can hit the limit before any cleanup happens.

    You can experiment with cleanup and other filesystems. The following log
    related issue might be relevant.

    https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

    Regards,
    Edward
  • Johannes Zillmann at Jun 17, 2010 at 12:07 pm
    Seems like this is something with folder restrictions.
    Tried:
    cp -r logs/userlogsOLD/* logs/userlogs/
    and got
    cp: cannot create directory `logs/userlogs/attempt_201006091425_0049_m_003169_0': Too many links

    Johannes
    On Jun 16, 2010, at 9:30 AM, Manhee Jo wrote:

    Hi,

    I've also encountered the same nonzero status of 1 error before.
    What did you set to mapred.child.ulimit and mapred.child.java.opts?
    mapred.child.ulimit must be greater than the -Xmx passed to JavaVM,
    else the VM might not start. That's wat MR tutorial says.
    Setting bigger ulimit, I could solve the problem.
    Hope this help.


    Regards,
    Manhee

    ----- Original Message ----- From: "Edward Capriolo" <edlinuxguru@gmail.com>
    To: <common-user@hadoop.apache.org>
    Sent: Tuesday, June 15, 2010 2:47 AM
    Subject: Re: Task process exit with nonzero status of 1 - deleting userlogshelps

    On Mon, Jun 14, 2010 at 1:15 PM, Johannes Zillmann <jzillmann@googlemail.com
    wrote:
    Hi,

    i have running a 4-node cluster with hadoop-0.20.2. Now i suddenly run into
    a situation where every task scheduled on 2 of the 4 nodes failed.
    Seems like the child jvm crashes. There are no child logs under
    logs/userlogs. Tasktracker gives this:

    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: In
    JvmRunner constructed JVM ID: jvm_201006091425_0049_m_-946174604
    2010-06-14 09:34:12,714 INFO org.apache.hadoop.mapred.JvmManager: JVM
    Runner jvm_201006091425_0049_m_-946174604 spawned.
    2010-06-14 09:34:12,727 INFO org.apache.hadoop.mapred.JvmManager: JVM :
    jvm_201006091425_0049_m_-946174604 exited. Number of tasks it ran: 0
    2010-06-14 09:34:12,727 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_201006091425_0049_m_003179_0 Child Error
    java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)


    At some point i simply renamed logs/userlogs to logs/userlogsOLD. A new job
    created the logs/userlogs again and no error ocuured anymore on this host.
    The permissions of userlogs and userlogsOLD are exactly the same.
    userlogsOLD contains about 378M in 132747 files. When copying the content of
    userlogsOLD into userlogs, the tasks of the belonging node starts failing
    again.

    Some questions:
    - this seems to me like a problem with too many files in one folder - any
    thoughts on this ?
    - is the content of logs/userlogs cleaned up by hadoop regularly ?
    - the logs/stdout file of the tasks are not existent, the logs/out fiels of
    the tasktracker hasn't any specific message (other then message posted
    above) - is there any log file left where an error message could be found ?


    best regards
    Johannes

    Most file systems have an upper limit on number of subfiles/folders in a
    folder. You have probably hit the EXT3 limit. If you launch lots and lots of
    jobs you can hit the limit before any cleanup happens.

    You can experiment with cleanup and other filesystems. The following log
    related issue might be relevant.

    https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614

    Regards,
    Edward

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 14, '10 at 5:16p
activeJun 17, '10 at 12:07p
posts11
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase