FAQ
I am seeing on one of my long running jobs about 50-60 hours that after 24
hours all
active reduce task fail with the error messages

java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Is there something in the config that I can change to stop this?

Every time with in 1 min of 24 hours they all fail at the same time.
waist a lot of resource downloading the map outputs and merging them again.

Billy

Search Discussions

  • Amareshwari Sriramadasu at Mar 26, 2009 at 3:45 am
    Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours
    to higher value. By default, their values are 24 hours. These might be
    the reason for failure, though I'm not sure.

    Thanks
    Amareshwari

    Billy Pearson wrote:
    I am seeing on one of my long running jobs about 50-60 hours that
    after 24 hours all
    active reduce task fail with the error messages

    java.io.IOException: Task process exit with nonzero status of 255.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

    Is there something in the config that I can change to stop this?

    Every time with in 1 min of 24 hours they all fail at the same time.
    waist a lot of resource downloading the map outputs and merging them
    again.

    Billy
  • Amar Kamat at Mar 26, 2009 at 4:00 am

    Amareshwari Sriramadasu wrote:
    Set mapred.jobtracker.retirejob.interval
    This is used to retire completed jobs.
    and mapred.userlog.retain.hours to higher value.
    This is used to discard user logs.
    By default, their values are 24 hours. These might be the reason for
    failure, though I'm not sure.

    Thanks
    Amareshwari

    Billy Pearson wrote:
    I am seeing on one of my long running jobs about 50-60 hours that
    after 24 hours all
    active reduce task fail with the error messages

    java.io.IOException: Task process exit with nonzero status of 255.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

    Is there something in the config that I can change to stop this?

    Every time with in 1 min of 24 hours they all fail at the same time.
    waist a lot of resource downloading the map outputs and merging them
    again.
    What is the state of the reducer (copy or sort)? Check
    jobtracker/task-tracker logs to see what is the state of these reducers
    and whether it issued a kill signal. Either jobtracker/tasktracker is
    issuing a kill signal or the reducers are committing suicide. Were there
    any failures on the reducer side while pulling the map output? Also what
    is the nature of the job? How fast the maps finish?
    Amar
    Billy
  • Amar Kamat at Mar 26, 2009 at 4:06 am

    Amar Kamat wrote:
    Amareshwari Sriramadasu wrote:
    Set mapred.jobtracker.retirejob.interval
    This is used to retire completed jobs.
    and mapred.userlog.retain.hours to higher value.
    This is used to discard user logs.
    As Amareshwari pointed out, this might be the cause. Can you increase
    this value and try?
    Amar
    By default, their values are 24 hours. These might be the reason for
    failure, though I'm not sure.

    Thanks
    Amareshwari

    Billy Pearson wrote:
    I am seeing on one of my long running jobs about 50-60 hours that
    after 24 hours all
    active reduce task fail with the error messages

    java.io.IOException: Task process exit with nonzero status of 255.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

    Is there something in the config that I can change to stop this?

    Every time with in 1 min of 24 hours they all fail at the same time.
    waist a lot of resource downloading the map outputs and merging them
    again.
    What is the state of the reducer (copy or sort)? Check
    jobtracker/task-tracker logs to see what is the state of these
    reducers and whether it issued a kill signal. Either
    jobtracker/tasktracker is issuing a kill signal or the reducers are
    committing suicide. Were there any failures on the reducer side while
    pulling the map output? Also what is the nature of the job? How fast
    the maps finish?
    Amar
    Billy
  • Billy Pearson at Mar 27, 2009 at 5:44 am
    mapred.jobtracker.retirejob.interval
    is not in the default config

    should this not be in the config?

    Billy



    "Amar Kamat" <amarrk@yahoo-inc.com> wrote in
    message news:49CAFF11.8070400@yahoo-inc.com...
    Amar Kamat wrote:
    Amareshwari Sriramadasu wrote:
    Set mapred.jobtracker.retirejob.interval
    This is used to retire completed jobs.
    and mapred.userlog.retain.hours to higher value.
    This is used to discard user logs.
    As Amareshwari pointed out, this might be the cause. Can you increase this
    value and try?
    Amar
    By default, their values are 24 hours. These might be the reason for
    failure, though I'm not sure.

    Thanks
    Amareshwari

    Billy Pearson wrote:
    I am seeing on one of my long running jobs about 50-60 hours that after
    24 hours all
    active reduce task fail with the error messages

    java.io.IOException: Task process exit with nonzero status of 255.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

    Is there something in the config that I can change to stop this?

    Every time with in 1 min of 24 hours they all fail at the same time.
    waist a lot of resource downloading the map outputs and merging them
    again.
    What is the state of the reducer (copy or sort)? Check
    jobtracker/task-tracker logs to see what is the state of these reducers
    and whether it issued a kill signal. Either jobtracker/tasktracker is
    issuing a kill signal or the reducers are committing suicide. Were there
    any failures on the reducer side while pulling the map output? Also what
    is the nature of the job? How fast the maps finish?
    Amar
    Billy
  • Billy Pearson at Mar 26, 2009 at 7:25 am
    There is many maps finshing from 4 mins to 15 mins less time closer to the
    end of the jobs so no timeout there. The state of the reduce task is Shuffle
    there grabing the map task as they finsh. the current job took 50:43:37 each
    of the reduce task failed twice in that time once at 24 hours in and second
    at 48 hours in. I will test on the next run in a few days the settings
    mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to 72
    hours and see if that solves the problem. So not a bad gess thought seams
    odd within 5 min's of 24 hours both times on all the task at the same time.


    looks like from the tasktracker logs I get the WARN below
    org.apache.hadoop.mapred.TaskRunner: attempt_200903212204_0005_r_000001_1
    Child Error


    grep the tasktracker log for one of the reduce that failed I do not have
    debug turned on so all I got is the info logs

    2009-03-25 18:37:45,473 INFO org.apache.hadoop.mapred.TaskTracker:
    attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551
    at 0.87 MB/s) >
    2009-03-25 18:37:48,476 INFO org.apache.hadoop.mapred.TaskTracker:
    attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551
    at 0.87 MB/s) >
    2009-03-25 18:37:49,194 INFO org.apache.hadoop.mapred.TaskTracker:
    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
    taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
    in any of the configured local directories
    2009-03-25 18:37:49,480 INFO org.apache.hadoop.mapred.TaskTracker:
    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
    taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
    in any of the configured local directories
    2009-03-25 18:37:51,481 INFO org.apache.hadoop.mapred.TaskTracker:
    attempt_200903212204_0005_r_000001_1 0.3083758% reduce > copy (2360 of 2551
    at 0.87 MB/s) >
    2009-03-25 18:37:54,372 WARN org.apache.hadoop.mapred.TaskRunner:
    attempt_200903212204_0005_r_000001_1 Child Error
    2009-03-25 18:37:54,497 INFO org.apache.hadoop.mapred.TaskTracker:
    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
    taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
    in any of the configured local directories
    2009-03-25 18:37:57,400 INFO org.apache.hadoop.mapred.TaskRunner:
    attempt_200903212204_0005_r_000001_1 done; removing files.
    2009-03-25 18:42:25,191 INFO org.apache.hadoop.mapred.TaskTracker:
    LaunchTaskAction (registerTask): attempt_200903212204_0005_r_000001_1 task's
    state:FAILED_UNCLEAN
    2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: Trying to
    launch : attempt_200903212204_0005_r_000001_1
    2009-03-25 18:42:25,192 INFO org.apache.hadoop.mapred.TaskTracker: In
    TaskLauncher, current free slots : 1 and trying to launch
    attempt_200903212204_0005_r_000001_1
    2009-03-25 18:42:30,134 INFO org.apache.hadoop.mapred.TaskTracker: JVM with
    ID: jvm_200903212204_0005_r_437314552 given task:
    attempt_200903212204_0005_r_000001_1
    2009-03-25 18:42:30,196 INFO org.apache.hadoop.mapred.TaskTracker:
    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
    taskTracker/jobcache/job_200903212204_0005/attempt_200903212204_0005_r_000001_1/output/file.out
    in any of the configured local directories
    2009-03-25 18:42:32,530 INFO org.apache.hadoop.mapred.TaskTracker:
    attempt_200903212204_0005_r_000001_1 0.0%
    2009-03-25 18:42:32,555 INFO org.apache.hadoop.mapred.TaskTracker:
    attempt_200903212204_0005_r_000001_1 0.0% cleanup
    2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: Task
    attempt_200903212204_0005_r_000001_1 is done.
    2009-03-25 18:42:32,567 INFO org.apache.hadoop.mapred.TaskTracker: reported
    output size for attempt_200903212204_0005_r_000001_1 was 0
    2009-03-25 18:42:32,568 INFO org.apache.hadoop.mapred.TaskRunner:
    attempt_200903212204_0005_r_000001_1 done; removing files.


    grep the jobtracker for the same task

    2009-03-25 18:37:54,500 INFO org.apache.hadoop.mapred.TaskInProgress: Error
    from attempt_200903212204_0005_r_000001_1: java.io.IOException: Task process
    exit with nonzero status of 255.
    2009-03-25 18:42:25,186 INFO org.apache.hadoop.mapred.JobTracker: Adding
    task (cleanup)'attempt_200903212204_0005_r_000001_1' to tip
    task_200903212204_0005_r_000001, for tracker
    'tracker_server-1:localhost.localdomain/127.0.0.1:38816'
    2009-03-25 18:42:32,589 INFO org.apache.hadoop.mapred.JobTracker: Removed
    completed task 'attempt_200903212204_0005_r_000001_1' from
    'tracker_server-1:localhost.localdomain/127.0.0.1:38816'






    "Amar Kamat" <amarrk@yahoo-inc.com> wrote in
    message news:49CAFD8E.8010700@yahoo-inc.com...
    Amareshwari Sriramadasu wrote:
    Set mapred.jobtracker.retirejob.interval
    This is used to retire completed jobs.
    and mapred.userlog.retain.hours to higher value.
    This is used to discard user logs.
    By default, their values are 24 hours. These might be the reason for
    failure, though I'm not sure.

    Thanks
    Amareshwari

    Billy Pearson wrote:
    I am seeing on one of my long running jobs about 50-60 hours that after
    24 hours all
    active reduce task fail with the error messages

    java.io.IOException: Task process exit with nonzero status of 255.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

    Is there something in the config that I can change to stop this?

    Every time with in 1 min of 24 hours they all fail at the same time.
    waist a lot of resource downloading the map outputs and merging them
    again.
    What is the state of the reducer (copy or sort)? Check
    jobtracker/task-tracker logs to see what is the state of these reducers
    and whether it issued a kill signal. Either jobtracker/tasktracker is
    issuing a kill signal or the reducers are committing suicide. Were there
    any failures on the reducer side while pulling the map output? Also what
    is the nature of the job? How fast the maps finish?
    Amar
    Billy

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 26, '09 at 2:24a
activeMar 27, '09 at 5:44a
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase