FAQ
I have the same problem on our cluster.

It seems the reducer-tasks are using all cpu, long before there's anything to
shuffle.

I started a profile of the reduce-task. I've attached the profiling output.
It seems from the samples that ramManager.waitForDataToMerge() doesn't
actually wait.
Has anybody seen this behavior.

Espen
On Thursday 28 August 2008 06:11:42 wangxu wrote:
Hi,all
I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
and running hadoop on one namenode and 4 slaves.
attached is my hadoop-site.xml, and I didn't change the file
hadoop-default.xml

when data in segments are large,this kind of errors occure:

java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129
file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.jav
a:1462) at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1
312) at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at
java.io.DataInputStream.readFully(DataInputStream.java:178)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64
) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
at
org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
at
org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.ja
va:1712) at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
1787) at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
ileRecordReader.java:104) at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
ader.java:79) at
org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.
java:112) at
org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReade
r.java:130) at
org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(Compo
siteRecordReader.java:398) at
org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:5
6) at
org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:3
3) at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)


how can I correct this?
thanks.
Xu

Search Discussions

  • Devaraj Das at Sep 4, 2008 at 1:34 pm

    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940

    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:

    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
    at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.jav
    a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1
    312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at
    java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64
    ) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
    at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.ja
    va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
    1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
    ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
    ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.
    java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReade
    r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(Compo
    siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:5
    6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:3
    3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)


    how can I correct this?
    thanks.
    Xu
  • Espen Amble Kolstad at Sep 5, 2008 at 9:35 am
    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940

    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:

    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
    at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.jav
    a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1
    312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at
    java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64
    ) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
    at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.ja
    va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
    1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
    ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
    ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.
    java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReade
    r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(Compo
    siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:5
    6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:3
    3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)


    how can I correct this?
    thanks.
    Xu
  • Devaraj Das at Sep 6, 2008 at 2:01 pm
    These exceptions are apparently coming from the dfs side of things. Could
    someone from the dfs side please look at these?

    On 9/5/08 3:04 PM, "Espen Amble Kolstad" wrote:

    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940

    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:

    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's anything
    to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
    at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.jav
    a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1
    312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at
    java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64
    ) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
    at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.ja
    va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
    1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
    ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
    ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.
    java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReade
    r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(Compo
    siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:5
    6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:3
    3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)


    how can I correct this?
    thanks.
    Xu
  • Dhruba Borthakur at Sep 7, 2008 at 7:43 am
    The DFS errors might have been caused by

    http://issues.apache.org/jira/browse/HADOOP-4040

    thanks,
    dhruba
    On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das wrote:
    These exceptions are apparently coming from the dfs side of things. Could
    someone from the dfs side please look at these?

    On 9/5/08 3:04 PM, "Espen Amble Kolstad" wrote:

    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940

    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:

    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's anything
    to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling output.
    It seems from the samples that ramManager.waitForDataToMerge() doesn't
    actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
    at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.jav
    a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1
    312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) at
    java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64
    ) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
    at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
    at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.ja
    va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
    1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
    ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
    ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.
    java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReade
    r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(Compo
    siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:5
    6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:3
    3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)


    how can I correct this?
    thanks.
    Xu
  • Espen Amble Kolstad at Sep 8, 2008 at 9:25 am
    Hi,

    Thanks for the tip!

    I tried revision 692572 of the 0.18 branch, but I still get the same errors.
    On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
    The DFS errors might have been caused by

    http://issues.apache.org/jira/browse/HADOOP-4040

    thanks,
    dhruba
    On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das wrote:
    These exceptions are apparently coming from the dfs side of things. Could
    someone from the dfs side please look at these?
    On 9/5/08 3:04 PM, "Espen Amble Kolstad" wrote:
    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940
    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:
    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's
    anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block:
    blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-0000
    2/data at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
    nt.jav a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
    java:1 312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
    17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
    ava:64 ) at
    org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
    ) at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
    :1646) at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
    ile.ja va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
    .java: 1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
    uenceF ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
    cordRe ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
    eader. java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
    dReade r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
    (Compo siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:5 6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:3 3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
    a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
    org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
    )


    how can I correct this?
    thanks.
    Xu
  • Espen Amble Kolstad at Sep 8, 2008 at 10:40 am
    There's a JIRA on this already:
    https://issues.apache.org/jira/browse/HADOOP-3831
    Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
    to do the trick for now.

    Espen
    On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad wrote:
    Hi,

    Thanks for the tip!

    I tried revision 692572 of the 0.18 branch, but I still get the same errors.
    On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
    The DFS errors might have been caused by

    http://issues.apache.org/jira/browse/HADOOP-4040

    thanks,
    dhruba
    On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das wrote:
    These exceptions are apparently coming from the dfs side of things. Could
    someone from the dfs side please look at these?
    On 9/5/08 3:04 PM, "Espen Amble Kolstad" wrote:
    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940
    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:
    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's
    anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block:
    blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-0000
    2/data at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
    nt.jav a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
    java:1 312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
    17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
    ava:64 ) at
    org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
    ) at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
    :1646) at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
    ile.ja va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
    .java: 1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
    uenceF ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
    cordRe ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
    eader. java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
    dReade r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
    (Compo siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:5 6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:3 3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
    a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
    org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
    )


    how can I correct this?
    thanks.
    Xu
  • Stefan Will at Sep 9, 2008 at 11:18 pm
    I'm not sure whether this is the same issue or not, but on my 4 slave
    cluster, setting the below parameter doesn't seem to fix the issue.

    What I'm seeing is that occasionally data nodes stop responding for up to 10
    minutes at a time. In this case, the TaskTrackers will mark the nodes as
    dead, and occasionally the namenode will mark them as dead as well (you can
    see the "Last Contact" time steadily increase for a random node at a time
    every half hour or so.

    This seems to be happening during times of high disk utilization.

    -- Stefan


    From: Espen Amble Kolstad <espen@trank.no>
    Reply-To: <core-user@hadoop.apache.org>
    Date: Mon, 8 Sep 2008 12:40:01 +0200
    To: <core-user@hadoop.apache.org>
    Subject: Re: Could not obtain block: blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

    There's a JIRA on this already:
    https://issues.apache.org/jira/browse/HADOOP-3831
    Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
    to do the trick for now.

    Espen
    On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad wrote:
    Hi,

    Thanks for the tip!

    I tried revision 692572 of the 0.18 branch, but I still get the same errors.
    On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
    The DFS errors might have been caused by

    http://issues.apache.org/jira/browse/HADOOP-4040

    thanks,
    dhruba
    On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das wrote:
    These exceptions are apparently coming from the dfs side of things. Could
    someone from the dfs side please look at these?
    On 9/5/08 3:04 PM, "Espen Amble Kolstad" wrote:
    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940
    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:
    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's
    anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block:
    blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-0000
    2/data at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
    nt.jav a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
    java:1 312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
    17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
    ava:64 ) at
    org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
    ) at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
    :1646) at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
    ile.ja va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
    .java: 1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
    uenceF ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
    cordRe ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
    eader. java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
    dReade r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
    (Compo siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:5 6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:3 3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
    a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
    org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
    )


    how can I correct this?
    thanks.
    Xu
  • Raghu Angadi at Sep 9, 2008 at 11:43 pm

    Espen Amble Kolstad wrote:
    There's a JIRA on this already:
    https://issues.apache.org/jira/browse/HADOOP-3831
    Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
    to do the trick for now.
    Please comment on HADOOP-3831 that you are seeing this error.. so that
    it gets committed. Did you try the patch for HADOOP-3831?

    thanks,
    Raghu.
    Espen
    On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad wrote:
    Hi,

    Thanks for the tip!

    I tried revision 692572 of the 0.18 branch, but I still get the same errors.
    On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
    The DFS errors might have been caused by

    http://issues.apache.org/jira/browse/HADOOP-4040

    thanks,
    dhruba
    On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das wrote:
    These exceptions are apparently coming from the dfs side of things. Could
    someone from the dfs side please look at these?
    On 9/5/08 3:04 PM, "Espen Amble Kolstad" wrote:
    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940
    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:
    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's
    anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block:
    blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-0000
    2/data at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
    nt.jav a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
    java:1 312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
    17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
    ava:64 ) at
    org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
    ) at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
    :1646) at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
    ile.ja va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
    .java: 1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
    uenceF ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
    cordRe ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
    eader. java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
    dReade r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
    (Compo siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:5 6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:3 3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
    a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
    org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
    )


    how can I correct this?
    thanks.
    Xu
  • Stefan Will at Sep 11, 2008 at 12:22 am
    I'll add a comment to Jira. I haven't tried the latest version of the patch
    yet, but since it's only changes the dfs client, not the datanode, I don't
    see how it would help with this.

    Two more things I noticed that happen when the datanodes become unresponsive
    (i.e. The "Last Contact" field on the namenode keeps increasing) is:

    1. The datanode process seem to be completely hung for a while, including
    its Jetty web interface, sometimes for over 10 minutes.

    2. The task tracker on the same machine keeps humming along, sending regular
    heartbeats

    To me this looks like there is some sort of temporary deadlock in the
    datanode that keeps it from responding to requests. Perhaps it's the block
    report being generated ?

    -- Stefan
    From: Raghu Angadi <rangadi@yahoo-inc.com>
    Reply-To: <core-user@hadoop.apache.org>
    Date: Tue, 09 Sep 2008 16:40:02 -0700
    To: <core-user@hadoop.apache.org>
    Subject: Re: Could not obtain block: blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

    Espen Amble Kolstad wrote:
    There's a JIRA on this already:
    https://issues.apache.org/jira/browse/HADOOP-3831
    Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
    to do the trick for now.
    Please comment on HADOOP-3831 that you are seeing this error.. so that
    it gets committed. Did you try the patch for HADOOP-3831?

    thanks,
    Raghu.
    Espen
    On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad wrote:
    Hi,

    Thanks for the tip!

    I tried revision 692572 of the 0.18 branch, but I still get the same errors.
    On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
    The DFS errors might have been caused by

    http://issues.apache.org/jira/browse/HADOOP-4040

    thanks,
    dhruba
    On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das wrote:
    These exceptions are apparently coming from the dfs side of things. Could
    someone from the dfs side please look at these?
    On 9/5/08 3:04 PM, "Espen Amble Kolstad" wrote:
    Hi,

    Thanks!
    The patch applies without change to hadoop-0.18.0, and should be
    included in a 0.18.1.

    However, I'm still seeing:
    in hadoop.log:
    2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading
    from blk_3428404120239503595_2664 of
    /user/trank/segments/20080905102650/crawl_generate/part-00010 from
    somehost:50010: java.io.IOException: Premeture EOF from in
    putStream

    in datanode.log:
    2008-09-05 11:15:09,554 WARN dfs.DataNode -
    DatanodeRegistration(somehost:50010,
    storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
    ipcPort=50020):Got exception while serving
    blk_-4682098638573619471_2662 to
    /somehost:
    java.net.SocketTimeoutException: 480000 millis timeout while waiting
    for channel to be ready for write. ch :
    java.nio.channels.SocketChannel[connected local=/somehost:50010
    remote=/somehost:45244]

    These entries in datanode.log happens a few minutes apart repeatedly.
    I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
    free memory (so it's not resource starvation).

    Espen
    On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das wrote:
    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940
    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:
    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's
    anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling
    output. It seems from the samples that ramManager.waitForDataToMerge()
    doesn't actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block:
    blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/part-0000
    2/data at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
    nt.jav a:1462) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.
    java:1 312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14
    17) at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j
    ava:64 ) at
    org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102
    ) at
    org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java
    :1646) at
    org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF
    ile.ja va:1712) at
    org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile
    .java: 1787) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq
    uenceF ileRecordReader.java:104) at
    org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe
    cordRe ader.java:79) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR
    eader. java:112) at
    org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor
    dReade r.java:130) at
    org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector
    (Compo siteRecordReader.java:398) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:5 6) at
    org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.
    java:3 3) at
    org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav
    a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at
    org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209
    )


    how can I correct this?
    thanks.
    Xu
  • Raghu Angadi at Sep 11, 2008 at 1:51 am
    Thanks Stefan.

    What you are seeing is fixed in HADOOP-3232. It is different from main
    problems reported in this thread. Please try 0.18.1 and see how it works.

    Raghu.

    Stefan Will wrote:
    I'll add a comment to Jira. I haven't tried the latest version of the patch
    yet, but since it's only changes the dfs client, not the datanode, I don't
    see how it would help with this.

    Two more things I noticed that happen when the datanodes become unresponsive
    (i.e. The "Last Contact" field on the namenode keeps increasing) is:

    1. The datanode process seem to be completely hung for a while, including
    its Jetty web interface, sometimes for over 10 minutes.

    2. The task tracker on the same machine keeps humming along, sending regular
    heartbeats

    To me this looks like there is some sort of temporary deadlock in the
    datanode that keeps it from responding to requests. Perhaps it's the block
    report being generated ?

    -- Stefan
  • Chris Douglas at Sep 6, 2008 at 9:33 pm
    FWIW: HADOOP-3940 is merged into the 0.18 branch and should be part of
    0.18.1. -C
    On Sep 4, 2008, at 6:33 AM, Devaraj Das wrote:

    I started a profile of the reduce-task. I've attached the profiling
    output.
    It seems from the samples that ramManager.waitForDataToMerge()
    doesn't
    actually wait.
    Has anybody seen this behavior.
    This has been fixed in HADOOP-3940

    On 9/4/08 6:36 PM, "Espen Amble Kolstad" wrote:

    I have the same problem on our cluster.

    It seems the reducer-tasks are using all cpu, long before there's
    anything to
    shuffle.

    I started a profile of the reduce-task. I've attached the profiling
    output.
    It seems from the samples that ramManager.waitForDataToMerge()
    doesn't
    actually wait.
    Has anybody seen this behavior.

    Espen
    On Thursday 28 August 2008 06:11:42 wangxu wrote:
    Hi,all
    I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
    and running hadoop on one namenode and 4 slaves.
    attached is my hadoop-site.xml, and I didn't change the file
    hadoop-default.xml

    when data in segments are large,this kind of errors occure:

    java.io.IOException: Could not obtain block:
    blk_-2634319951074439134_1129
    file=/user/root/crawl_debug/segments/20080825053518/content/
    part-00002/data
    at
    org.apache.hadoop.dfs.DFSClient
    $DFSInputStream.chooseDataNode(DFSClient.jav
    a:1462) at
    org.apache.hadoop.dfs.DFSClient
    $DFSInputStream.blockSeekTo(DFSClient.java:1
    312) at
    org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:
    1417) at
    java.io.DataInputStream.readFully(DataInputStream.java:178)
    at
    org.apache.hadoop.io.DataOutputBuffer
    $Buffer.write(DataOutputBuffer.java:64
    ) at
    org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:
    102)
    at
    org.apache.hadoop.io.SequenceFile
    $Reader.readBuffer(SequenceFile.java:1646)
    at
    org.apache.hadoop.io.SequenceFile
    $Reader.seekToCurrentValue(SequenceFile.ja
    va:1712) at
    org.apache.hadoop.io.SequenceFile
    $Reader.getCurrentValue(SequenceFile.java:
    1787) at
    org
    .apache
    .hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
    ileRecordReader.java:104) at
    org
    .apache
    .hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
    ader.java:79) at
    org
    .apache
    .hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.
    java:112) at
    org
    .apache
    .hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReade
    r.java:130) at
    org
    .apache
    .hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(Compo
    siteRecordReader.java:398) at
    org
    .apache
    .hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:5
    6) at
    org
    .apache
    .hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:3
    3) at
    org.apache.hadoop.mapred.MapTask
    $TrackedRecordReader.next(MapTask.java:165)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at org.apache.hadoop.mapred.TaskTracker
    $Child.main(TaskTracker.java:2209)


    how can I correct this?
    thanks.
    Xu

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 4, '08 at 1:07p
activeSep 11, '08 at 1:51a
posts12
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase