FAQ
If a reducer failed at shuffling stage, the task should fail, not just logging an exception
-------------------------------------------------------------------------------------------

Key: HADOOP-4163
URL: https://issues.apache.org/jira/browse/HADOOP-4163
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.17.1
Reporter: Runping Qi




I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:

2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
... 11 more

2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

The task should have died.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Devaraj Das (JIRA) at Sep 22, 2008 at 4:36 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-4163:
    --------------------------------

    Description:

    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:

    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more

    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

    The task should have died.



    was:


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:

    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more

    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)

    The task should have died.



    Priority: Blocker (was: Major)
    Fix Version/s: 0.19.0
    Assignee: Amareshwari Sriramadasu

    Marking this as a blocker until we get to the root cause..
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Amareshwari Sriramadasu
    Priority: Blocker
    Fix For: 0.19.0


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Sep 22, 2008 at 10:59 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633207#action_12633207 ]

    Amareshwari Sriramadasu commented on HADOOP-4163:
    -------------------------------------------------

    Runping, can you give some information about the job?
    1. What is typical map runtime in your job? Is it less than 4 seconds?
    2. What is the value of *mapred.reduce.copy.backoff* in your configuration?
    3. Were there any maps re-executed because of Too many fetch failures?
    4. The log you have attached has *Error running child* from TaskTracker. So, has the reducer died eventually? If so, are you saying that the reducer spent a lot of time in shuffle before it could die?
    5. What happened to the job finally? Did it succeed?

    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Amareshwari Sriramadasu
    Priority: Blocker
    Fix For: 0.19.0


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Sep 22, 2008 at 11:11 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633214#action_12633214 ]

    Amareshwari Sriramadasu commented on HADOOP-4163:
    -------------------------------------------------

    Or is it same as HADOOP-4115?
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Amareshwari Sriramadasu
    Priority: Blocker
    Fix For: 0.19.0


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Sep 25, 2008 at 5:54 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634379#action_12634379 ]

    Amareshwari Sriramadasu commented on HADOOP-4163:
    -------------------------------------------------

    Runping, Sorry for asking this late. Is it possible to get JobTracker and TaskTracker for the reduce task?
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Amareshwari Sriramadasu
    Priority: Blocker
    Fix For: 0.19.0


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Amareshwari Sriramadasu (JIRA) at Sep 25, 2008 at 6:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634379#action_12634379 ]

    amareshwari edited comment on HADOOP-4163 at 9/24/08 11:03 PM:
    ---------------------------------------------------------------------------

    Runping, Sorry for asking this late. Is it possible to get JobTracker and TaskTracker logs for the reduce task?

    was (Author: amareshwari):
    Runping, Sorry for asking this late. Is it possible to get JobTracker and TaskTracker for the reduce task?
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Amareshwari Sriramadasu
    Priority: Blocker
    Fix For: 0.19.0


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Sep 25, 2008 at 4:20 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634531#action_12634531 ]

    Runping Qi commented on HADOOP-4163:
    ------------------------------------


    They are all gone.

    I'll keep them next time.

    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Amareshwari Sriramadasu
    Priority: Blocker
    Fix For: 0.19.0


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Sep 26, 2008 at 8:04 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das reassigned HADOOP-4163:
    -----------------------------------

    Assignee: Sharad Agarwal (was: Amareshwari Sriramadasu)
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Sep 29, 2008 at 6:19 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Sharad Agarwal updated HADOOP-4163:
    -----------------------------------

    Attachment: 4163_v1.patch

    patch with the simple fix. It adds the check for FSError. In case FSError is encountered, notify the TT about it, so that TT can purge this task.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Sep 29, 2008 at 8:55 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Sharad Agarwal updated HADOOP-4163:
    -----------------------------------

    Status: Patch Available (was: Open)
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Sep 29, 2008 at 12:13 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635364#action_12635364 ]

    Hadoop QA commented on HADOOP-4163:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12391114/4163_v1.patch
    against trunk revision 700028.

    +1 @author. The patch does not contain any @author tags.

    -1 tests included. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs. The patch appears to introduce 1 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests. The patch failed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3391/console

    This message is automatically generated.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 30, 2008 at 1:22 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4163:
    ----------------------------------

    Status: Open (was: Patch Available)

    * handleIfFSError(t) doesn't need to be called in contexts where mergeThrowable is set. Equivalent code should be called after ReduceCopier::fetchOutputs returns false
    * Code handling FSError should be in a catch block, not handled using instanceof in a method call from a catch of Throwable. The retry loop is unnecessary. The call to System.exit is overly aggressive. (i.e. handleIfFSError should not exist)
    * Discarding map output cannot generate FSError and does not require handling.

    This should be replaced with a catch of FSError before Throwable in MapOutputCopier::run that calls umbilical.fsError (if it throws, the exception can be logged and ignored). If reduceCopier.fetchOutputs returns false, then reduceCopier.mergeThrowable should be the cause of the thrown exception (it's OK if it's null). If mergeThrowable is FSError, it would be reasonable to call umbilical.fsError before the throw.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Sep 30, 2008 at 7:36 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Sharad Agarwal updated HADOOP-4163:
    -----------------------------------

    Attachment: 4163_v2.patch

    incorporated Chris' comments
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Sep 30, 2008 at 12:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635736#action_12635736 ]

    Devaraj Das commented on HADOOP-4163:
    -------------------------------------

    Today, if the copier thread (ReduceTask.ReduceCopier.MapOutputCopier.run()) throws a Throwable, it is logged an ignored. I am wondering whether it makes sense to treat all exceptions except IOExceptions (mostly due to network issues) as fatal. Here is one thought -
    Rename mergeThrowable to shuffleThrowable. In the copier thread, we could set shuffleThrowable when Throwable is caught (IOException is caught separately already). In all the places where mergeThrowable is set, we could set shuffleThrowable. The loop inside fetchOutputs could check whether shuffleThrowable is non-null.
    When fetchOutputs returns with a 'false', we could check whether the shuffleThrowable is an instance of Error and if so, throw the Error out. In the other case, we could wrap it in an IOException. Doing it in the above way would mean that we call umbilical.fsError at exactly one place - in Child.main().
    But I am slightly apprehensive about the implication of this change this late in the game.. Thoughts ?

    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Arun C Murthy (JIRA) at Sep 30, 2008 at 6:06 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635798#action_12635798 ]

    Arun C Murthy commented on HADOOP-4163:
    ---------------------------------------

    bq. But I am slightly apprehensive about the implication of this change this late in the game.. Thoughts ?

    I agree, we probably should fix that in 0.20.0.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Sep 30, 2008 at 6:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635807#action_12635807 ]

    Chris Douglas commented on HADOOP-4163:
    ---------------------------------------

    * Calling fsError with the message from the exception would probably be more useful
    * Instead of rethrowing FSError, setting reduceCopier.mergeThrowable as the cause of the IOE thrown is both more polite and also useful when the merge fails
    * The check for null before instanceof is [redundant|http://java.sun.com/docs/books/jls/third_edition/html/expressions.html#15.20.2]

    bq. I am wondering whether it makes sense to treat all exceptions except IOExceptions (mostly due to network issues) as fatal [...]
    If we ignore IOException, that leaves unchecked exceptions and Errors. Other than FSError, what do we expect, or why we would expect other errors from fetch threads to kill the task profitably? I think it will improve the structure of the code, but it seems risky for 0.19 unless we observe other errors that should kill the task and don't.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Sep 30, 2008 at 7:10 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635812#action_12635812 ]

    Devaraj Das commented on HADOOP-4163:
    -------------------------------------

    Agree with you Chris and Arun.. Let's do the code improvement later.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Oct 6, 2008 at 8:32 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Sharad Agarwal updated HADOOP-4163:
    -----------------------------------

    Attachment: 4163_v3.patch

    incorporated Chris' feedback
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Oct 6, 2008 at 8:36 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Sharad Agarwal updated HADOOP-4163:
    -----------------------------------

    Status: Patch Available (was: Open)
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Oct 6, 2008 at 11:06 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637072#action_12637072 ]

    Hadoop QA commented on HADOOP-4163:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12391525/4163_v3.patch
    against trunk revision 701948.

    +1 @author. The patch does not contain any @author tags.

    -1 tests included. The patch doesn't appear to include any new or modified tests.
    Please justify why no tests are needed for this patch.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    -1 core tests. The patch failed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3435/console

    This message is automatically generated.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sharad Agarwal (JIRA) at Oct 6, 2008 at 11:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637076#action_12637076 ]

    Sharad Agarwal commented on HADOOP-4163:
    ----------------------------------------

    test failure is unrelated.
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Oct 6, 2008 at 10:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637277#action_12637277 ]

    Chris Douglas commented on HADOOP-4163:
    ---------------------------------------

    +1
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Oct 6, 2008 at 10:27 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4163:
    ----------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Sharad
    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Oct 7, 2008 at 1:32 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637478#action_12637478 ]

    Hudson commented on HADOOP-4163:
    --------------------------------

    Integrated in Hadoop-trunk #626 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/626/])
    . Report FSErrors from map output fetch threads instead of
    merely logging them. Contributed by Sharad Agarwal.

    If a reducer failed at shuffling stage, the task should fail, not just logging an exception
    -------------------------------------------------------------------------------------------

    Key: HADOOP-4163
    URL: https://issues.apache.org/jira/browse/HADOOP-4163
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Affects Versions: 0.17.1
    Reporter: Runping Qi
    Assignee: Sharad Agarwal
    Priority: Blocker
    Fix For: 0.19.0

    Attachments: 4163_v1.patch, 4163_v2.patch, 4163_v3.patch


    I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file:
    2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
    at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
    at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
    at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
    Caused by: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
    ... 11 more
    2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
    java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
    The task should have died.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedSep 11, '08 at 10:04p
activeOct 7, '08 at 1:32p
posts24
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 24 posts

People

Translate

site design / logo © 2022 Grokbase