FAQ
Hi everyone,
(apologies if this gets posted on the list twice for some reason, my first
attempt was denied as "suspected spam")

I ran a job last night with Hadoop 0.18.0 on EC2, using the standard small
AMI. The job was producing gzipped output, otherwise I haven't changed the
configuration.

The final reduce steps failed with this error that I haven't seem before:

2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200809301822_0005_r_000001_0 Merging of the local FS files threw an
exception: java.io.IOException: java.io.IOException: Rec# 289050: Negative
value-length: -96
at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:331)
at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134)
at
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:225)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2021)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2025)

2008-10-01 05:02:44,131 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: attempt_200809301822_0005_r_000001_0The reduce copier
failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)

When I try to download the data from HDFS I get a "Found checksum error"
warning message.

Any ideas what could be the cause? Would upgrading to 0.18.1 solve it?
Thanks,
/ Per

Search Discussions

  • Arun C Murthy at Oct 1, 2008 at 6:24 pm

    On Oct 1, 2008, at 11:07 AM, Per Jacobsson wrote:
    I ran a job last night with Hadoop 0.18.0 on EC2, using the standard
    small
    AMI. The job was producing gzipped output, otherwise I haven't
    changed the
    configuration.

    The final reduce steps failed with this error that I haven't seem
    before:

    2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_200809301822_0005_r_000001_0 Merging of the local FS files
    threw an
    exception: java.io.IOException: java.io.IOException: Rec# 289050:
    Negative
    value-length: -96
    Do you still have the task logs for the reduce?

    I suspect are running into http://issues.apache.org/jira/browse/HADOOP-3647
    which we never could reproduce reliably to pin it down or fix.

    However, in light of http://issues.apache.org/jira/browse/HADOOP-4277
    we suspect this could be caused by a bug in the LocalFileSystem which
    could hide data-corruption on your local disk leading to errors on
    these nature. Could you try running your job with that patch once the
    release 0.18.2 is available?

    Any information you provide could greatly aid to confirm our above
    hypothesis, so it's much appreciated!

    Arun
  • Per Jacobsson at Oct 1, 2008 at 7:05 pm
    I've collected the syslogs from the failed reduce jobs. What's the best way
    to get them to you? Let me know if you need anything else, I'll have to shut
    down these instances some time later today.

    Overall I've run this same job before with no problems. The only change is
    the added gzip of the output. Don't know if it's worth anything, but the
    four failures all happened on different machines. I'll be running this job
    plenty of times so if the problem keeps happening it will be obvious.
    / Per
    On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy wrote:


    Do you still have the task logs for the reduce?

    I suspect are running into
    http://issues.apache.org/jira/browse/HADOOP-3647 which we never could
    reproduce reliably to pin it down or fix.

    However, in light of http://issues.apache.org/jira/browse/HADOOP-4277 we
    suspect this could be caused by a bug in the LocalFileSystem which could
    hide data-corruption on your local disk leading to errors on these nature.
    Could you try running your job with that patch once the release 0.18.2 is
    available?

    Any information you provide could greatly aid to confirm our above
    hypothesis, so it's much appreciated!

    Arun
  • Arun C Murthy at Oct 1, 2008 at 8:36 pm

    On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote:

    I've collected the syslogs from the failed reduce jobs. What's the
    best way
    to get them to you? Let me know if you need anything else, I'll have
    to shut
    down these instances some time later today.
    Could you please attach them to the jira: http://issues.apache.org/jira/browse/HADOOP-3647?
    Thanks!

    Arun
    Overall I've run this same job before with no problems. The only
    change is
    the added gzip of the output. Don't know if it's worth anything, but
    the
    four failures all happened on different machines. I'll be running
    this job
    plenty of times so if the problem keeps happening it will be obvious.
    / Per
    With 0.18 we rewrote the path from the output of the map, shuffle and
    the merge on the reducer. So, that could be a bug - again, we hope http://issues.apache.org/jira/browse/HADOOP-4277
    will fix this.

    Arun
    On Wed, Oct 1, 2008 at 11:23 AM, Arun C Murthy wrote:


    Do you still have the task logs for the reduce?

    I suspect are running into
    http://issues.apache.org/jira/browse/HADOOP-3647 which we never could
    reproduce reliably to pin it down or fix.

    However, in light of http://issues.apache.org/jira/browse/
    HADOOP-4277 we
    suspect this could be caused by a bug in the LocalFileSystem which
    could
    hide data-corruption on your local disk leading to errors on these
    nature.
    Could you try running your job with that patch once the release
    0.18.2 is
    available?

    Any information you provide could greatly aid to confirm our above
    hypothesis, so it's much appreciated!

    Arun
  • Per Jacobsson at Oct 1, 2008 at 8:45 pm
    Attached to the ticket. Hope this helps.
    / Per
    On Wed, Oct 1, 2008 at 1:33 PM, Arun C Murthy wrote:


    On Oct 1, 2008, at 12:04 PM, Per Jacobsson wrote:

    I've collected the syslogs from the failed reduce jobs. What's the best
    way
    to get them to you? Let me know if you need anything else, I'll have to
    shut
    down these instances some time later today.
    Could you please attach them to the jira:
    http://issues.apache.org/jira/browse/HADOOP-3647? Thanks!

    Arun
  • Per Jacobsson at Oct 2, 2008 at 10:55 pm
    Quick FYI: I've run the same job twice more without seeing the error.
    / Per
    On Wed, Oct 1, 2008 at 11:07 AM, Per Jacobsson wrote:

    Hi everyone,
    (apologies if this gets posted on the list twice for some reason, my first
    attempt was denied as "suspected spam")

    I ran a job last night with Hadoop 0.18.0 on EC2, using the standard small
    AMI. The job was producing gzipped output, otherwise I haven't changed the
    configuration.

    The final reduce steps failed with this error that I haven't seem before:

    2008-10-01 05:02:39,810 WARN org.apache.hadoop.mapred.ReduceTask:
    attempt_200809301822_0005_r_000001_0 Merging of the local FS files threw an
    exception: java.io.IOException: java.io.IOException: Rec# 289050: Negative
    value-length: -96
    at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:331)
    at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:134)
    at
    org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:225)
    at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:242)
    at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:83)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2021)
    at
    org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2025)

    2008-10-01 05:02:44,131 WARN org.apache.hadoop.mapred.TaskTracker: Error
    running child
    java.io.IOException: attempt_200809301822_0005_r_000001_0The reduce copier
    failed
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)

    When I try to download the data from HDFS I get a "Found checksum error"
    warning message.

    Any ideas what could be the cause? Would upgrading to 0.18.1 solve it?
    Thanks,
    / Per

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 1, '08 at 6:08p
activeOct 2, '08 at 10:55p
posts6
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Per Jacobsson: 4 posts Arun C Murthy: 2 posts

People

Translate

site design / logo © 2022 Grokbase