FAQ
Hey all.

We've been running into a very annoying problem pretty frequently
lately. We'll be running some job, for instance a distcp, and it'll
be moving along quite nicely, until all of the sudden, it sort of
freezes up. It takes a while, and then we'll get an error like this one:

attempt_200809261607_0003_m_000002_0: Exception closing file /tmp/
dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile
attempt_200809261607_0003_m_000002_0: java.io.IOException: Could not
get block locations. Aborting...
attempt_200809261607_0003_m_000002_0: at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError
(DFSClient.java:2143)
attempt_200809261607_0003_m_000002_0: at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400
(DFSClient.java:1735)
attempt_200809261607_0003_m_000002_0: at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run
(DFSClient.java:1889)

At approximately the same time, we start seeing lots of these errors
in the namenode log:

2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR*
NameSystem.startFile: failed to create file /tmp/dustin/input/
input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for
DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
because current leaseholder is trying to recreate file.
2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/
_distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x,
DFSClient_attempt_200809261607_0003_m_000002_1, true, 3, 67108864)
from 10.100.11.83:60056: error:
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
part-01897.bucketfile for
DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
because current leaseholder is trying to recreate file.
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
part-01897.bucketfile for
DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
because current leaseholder is trying to recreate file.
at org.apache.hadoop.dfs.FSNamesystem.startFileInternal
(FSNamesystem.java:952)
at org.apache.hadoop.dfs.FSNamesystem.startFile
(FSNamesystem.java:903)
at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



Eventually, the job fails because of these errors. Subsequent job
runs also experience this problem and fail. The only way we've been
able to recover is to restart the DFS. It doesn't happen every time,
but it does happen often enough that I'm worried.

Does anyone have any ideas as to why this might be happening? I
thought that https://issues.apache.org/jira/browse/HADOOP-2669 might
be the culprit, but today we upgraded to hadoop 0.18.1 and the
problem still happens.

Thanks,

Bryan

Search Discussions

  • Hairong Kuang at Sep 26, 2008 at 11:50 pm
    Does your failed map task open a lot of files to write? Could you please check the log of the datanode running at the machine where the map tasks failed? Do you see any error message containing "exceeds the limit of concurrent xcievers"?

    Hairong

    ________________________________

    From: Bryan Duxbury
    Sent: Fri 9/26/2008 4:36 PM
    To: core-user@hadoop.apache.org
    Subject: "Could not get block locations. Aborting..." exception



    Hey all.

    We've been running into a very annoying problem pretty frequently
    lately. We'll be running some job, for instance a distcp, and it'll
    be moving along quite nicely, until all of the sudden, it sort of
    freezes up. It takes a while, and then we'll get an error like this one:

    attempt_200809261607_0003_m_000002_0: Exception closing file /tmp/
    dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile
    attempt_200809261607_0003_m_000002_0: java.io.IOException: Could not
    get block locations. Aborting...
    attempt_200809261607_0003_m_000002_0: at
    org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError
    (DFSClient.java:2143)
    attempt_200809261607_0003_m_000002_0: at
    org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400
    (DFSClient.java:1735)
    attempt_200809261607_0003_m_000002_0: at
    org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run
    (DFSClient.java:1889)

    At approximately the same time, we start seeing lots of these errors
    in the namenode log:

    2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR*
    NameSystem.startFile: failed to create file /tmp/dustin/input/
    input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for
    DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
    because current leaseholder is trying to recreate file.
    2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server
    handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/
    _distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x,
    DFSClient_attempt_200809261607_0003_m_000002_1, true, 3, 67108864)
    from 10.100.11.83:60056: error:
    org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
    file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
    part-01897.bucketfile for
    DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
    because current leaseholder is trying to recreate file.
    org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
    file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
    part-01897.bucketfile for
    DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
    because current leaseholder is trying to recreate file.
    at org.apache.hadoop.dfs.FSNamesystem.startFileInternal
    (FSNamesystem.java:952)
    at org.apache.hadoop.dfs.FSNamesystem.startFile
    (FSNamesystem.java:903)
    at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
    at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke
    (DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



    Eventually, the job fails because of these errors. Subsequent job
    runs also experience this problem and fail. The only way we've been
    able to recover is to restart the DFS. It doesn't happen every time,
    but it does happen often enough that I'm worried.

    Does anyone have any ideas as to why this might be happening? I
    thought that https://issues.apache.org/jira/browse/HADOOP-2669 might
    be the culprit, but today we upgraded to hadoop 0.18.1 and the
    problem still happens.

    Thanks,

    Bryan
  • Bryan Duxbury at Sep 27, 2008 at 4:53 am
    Well, I did find some more errors in the datanode log. Here's a
    sampling:

    2008-09-26 10:43:57,287 ERROR org.apache.hadoop.dfs.DataNode:
    DatanodeRegistration(10.100.11.115:50010,
    storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
    infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException:
    Block blk_-3923611845661840838_176295 is not valid.
    at org.apache.hadoop.dfs.FSDataset.getBlockFile
    (FSDataset.java:716)
    at org.apache.hadoop.dfs.FSDataset.getLength(FSDataset.java:
    704)
    at org.apache.hadoop.dfs.DataNode$BlockSender.<init>
    (DataNode.java:1678)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock
    (DataNode.java:1101)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.run
    (DataNode.java:1037)

    2008-09-26 10:56:19,325 ERROR org.apache.hadoop.dfs.DataNode:
    DatanodeRegistration(10.100.11.115:50010,
    storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
    infoPort=50075, ipcPort=50020):DataXceiver: java.io.EOFException:
    while trying to read 65557 bytes
    at org.apache.hadoop.dfs.DataNode$BlockReceiver.readToBuf
    (DataNode.java:2464)
    at org.apache.hadoop.dfs.DataNode
    $BlockReceiver.readNextPacket(DataNode.java:2508)
    at org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket
    (DataNode.java:2572)
    at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock
    (DataNode.java:2698)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock
    (DataNode.java:1283)

    2008-09-26 10:56:19,779 ERROR org.apache.hadoop.dfs.DataNode:
    DatanodeRegistration(10.100.11.115:50010,
    storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
    infoPort=50075, ipcPort=50020):DataXceiver: java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at org.apache.hadoop.dfs.DataNode$DataXceiver.run
    (DataNode.java:1021)
    at java.lang.Thread.run(Thread.java:619)

    2008-09-26 10:56:21,816 ERROR org.apache.hadoop.dfs.DataNode:
    DatanodeRegistration(10.100.11.115:50010,
    storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
    infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException:
    Could not read from stream
    at org.apache.hadoop.net.SocketInputStream.read
    (SocketInputStream.java:119)
    at java.io.DataInputStream.readByte(DataInputStream.java:248)
    at org.apache.hadoop.io.WritableUtils.readVLong
    (WritableUtils.java:324)
    at org.apache.hadoop.io.WritableUtils.readVInt
    (WritableUtils.java:345)
    at org.apache.hadoop.io.Text.readString(Text.java:410)

    2008-09-26 10:56:28,380 ERROR org.apache.hadoop.dfs.DataNode:
    DatanodeRegistration(10.100.11.115:50010,
    storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
    infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException:
    Connection reset by peer
    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
    at sun.nio.ch.IOUtil.read(IOUtil.java:206)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:
    236)

    2008-09-26 10:56:52,387 ERROR org.apache.hadoop.dfs.DataNode:
    DatanodeRegistration(10.100.11.115:50010,
    storageID=DS-1784982905-10.100.11.115-50010-1221785192226,
    infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: Too
    many open files
    at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method)
    at sun.nio.ch.EPollArrayWrapper.<init>
    (EPollArrayWrapper.java:59)
    at sun.nio.ch.EPollSelectorImpl.<init>
    (EPollSelectorImpl.java:52)
    at sun.nio.ch.EPollSelectorProvider.openSelector
    (EPollSelectorProvider.java:18)
    at sun.nio.ch.Util.getTemporarySelector(Util.java:123)

    The most interesting one in my eyes is the too many open files one.
    My ulimit is 1024. How much should it be? I don't think that I have
    that many files open in my mappers. They should only be operating on
    a single file at a time. I can try to run the job again and get an
    lsof if it would be interesting.

    Thanks for taking the time to reply, by the way.

    -Bryan

    On Sep 26, 2008, at 4:48 PM, Hairong Kuang wrote:

    Does your failed map task open a lot of files to write? Could you
    please check the log of the datanode running at the machine where
    the map tasks failed? Do you see any error message containing
    "exceeds the limit of concurrent xcievers"?

    Hairong

    ________________________________

    From: Bryan Duxbury
    Sent: Fri 9/26/2008 4:36 PM
    To: core-user@hadoop.apache.org
    Subject: "Could not get block locations. Aborting..." exception



    Hey all.

    We've been running into a very annoying problem pretty frequently
    lately. We'll be running some job, for instance a distcp, and it'll
    be moving along quite nicely, until all of the sudden, it sort of
    freezes up. It takes a while, and then we'll get an error like this
    one:

    attempt_200809261607_0003_m_000002_0: Exception closing file /tmp/
    dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile
    attempt_200809261607_0003_m_000002_0: java.io.IOException: Could not
    get block locations. Aborting...
    attempt_200809261607_0003_m_000002_0: at
    org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError
    (DFSClient.java:2143)
    attempt_200809261607_0003_m_000002_0: at
    org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400
    (DFSClient.java:1735)
    attempt_200809261607_0003_m_000002_0: at
    org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run
    (DFSClient.java:1889)

    At approximately the same time, we start seeing lots of these errors
    in the namenode log:

    2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR*
    NameSystem.startFile: failed to create file /tmp/dustin/input/
    input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for
    DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
    because current leaseholder is trying to recreate file.
    2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server
    handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/
    _distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x,
    DFSClient_attempt_200809261607_0003_m_000002_1, true, 3, 67108864)
    from 10.100.11.83:60056: error:
    org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
    file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
    part-01897.bucketfile for
    DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
    because current leaseholder is trying to recreate file.
    org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
    file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/
    part-01897.bucketfile for
    DFSClient_attempt_200809261607_0003_m_000002_1 on client 10.100.11.83
    because current leaseholder is trying to recreate file.
    at org.apache.hadoop.dfs.FSNamesystem.startFileInternal
    (FSNamesystem.java:952)
    at org.apache.hadoop.dfs.FSNamesystem.startFile
    (FSNamesystem.java:903)
    at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
    at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown
    Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke
    (DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)



    Eventually, the job fails because of these errors. Subsequent job
    runs also experience this problem and fail. The only way we've been
    able to recover is to restart the DFS. It doesn't happen every time,
    but it does happen often enough that I'm worried.

    Does anyone have any ideas as to why this might be happening? I
    thought that https://issues.apache.org/jira/browse/HADOOP-2669 might
    be the culprit, but today we upgraded to hadoop 0.18.1 and the
    problem still happens.

    Thanks,

    Bryan
  • Raghu Angadi at Sep 29, 2008 at 5:21 pm

    The most interesting one in my eyes is the too many open files one. My
    ulimit is 1024. How much should it be? I don't think that I have that
    many files open in my mappers. They should only be operating on a single
    file at a time. I can try to run the job again and get an lsof if it
    would be interesting.

    Thanks for taking the time to reply, by the way.
    For the current implementation, you need around 3x fds. 1024 is too low
    for Hadoop. The Hadoop requirement will come down, but 1024 would be too
    low anyway.

    Raghu.
  • Doug Cutting at Sep 29, 2008 at 5:33 pm

    Raghu Angadi wrote:
    For the current implementation, you need around 3x fds. 1024 is too low
    for Hadoop. The Hadoop requirement will come down, but 1024 would be too
    low anyway.
    1024 is the default on many systems. Shouldn't we try to make the
    default configuration work well there? If not, we should document this
    prominently in the "Cluster Setup" documentation?

    Doug
  • Raghu Angadi at Sep 29, 2008 at 5:39 pm

    Doug Cutting wrote:
    Raghu Angadi wrote:
    For the current implementation, you need around 3x fds. 1024 is too
    low for Hadoop. The Hadoop requirement will come down, but 1024 would
    be too low anyway.
    1024 is the default on many systems. Shouldn't we try to make the
    default configuration work well there?
    How can 1024 work well for different kinds of loads?
    If not, we should document this
    prominently in the "Cluster Setup" documentation?
    Yes, certainly.

    Raghu.
  • Raghu Angadi at Sep 29, 2008 at 5:42 pm

    Raghu Angadi wrote:
    Doug Cutting wrote:
    Raghu Angadi wrote:
    For the current implementation, you need around 3x fds. 1024 is too
    low for Hadoop. The Hadoop requirement will come down, but 1024 would
    be too low anyway.
    1024 is the default on many systems. Shouldn't we try to make the
    default configuration work well there?
    How can 1024 work well for different kinds of loads?
    oops! 1024 should work for anyone "working with just one file" for any
    load. I didn't notice that. My comment can be ignored.

    Raghu.
  • Bryan Duxbury at Sep 30, 2008 at 1:48 am
    Ok, so, what might I do next to try and diagnose this? Does it sound
    like it might be an HDFS/mapreduce bug, or should I pore over my own
    code first?

    Also, did any of the other exceptions look interesting?

    -Bryan
    On Sep 29, 2008, at 10:40 AM, Raghu Angadi wrote:

    Raghu Angadi wrote:
    Doug Cutting wrote:
    Raghu Angadi wrote:
    For the current implementation, you need around 3x fds. 1024 is
    too low for Hadoop. The Hadoop requirement will come down, but
    1024 would be too low anyway.
    1024 is the default on many systems. Shouldn't we try to make
    the default configuration work well there?
    How can 1024 work well for different kinds of loads?
    oops! 1024 should work for anyone "working with just one file" for
    any load. I didn't notice that. My comment can be ignored.

    Raghu.
  • Raghu Angadi at Sep 30, 2008 at 3:24 am

    Bryan Duxbury wrote:
    Ok, so, what might I do next to try and diagnose this? Does it sound
    like it might be an HDFS/mapreduce bug, or should I pore over my own
    code first?

    Also, did any of the other exceptions look interesting?
    The exceptions closest to the failure time would be most important..
    which are those? You can start from there.

    At a given time, check /proc/pid/fs (on linux) to verify you actually
    have only one block open. Most likely that is not the case. Then you can
    try to figure out how many blocks are open (from Datanode log),.. then
    what files they belong to etc.

    hope this helps.
    Raghu.
  • Raghu Angadi at Oct 5, 2008 at 6:25 am
    https://issues.apache.org/jira/browse/HADOOP-4346 might explain this.

    Raghu.

    Bryan Duxbury wrote:
    Ok, so, what might I do next to try and diagnose this? Does it sound
    like it might be an HDFS/mapreduce bug, or should I pore over my own
    code first?

    Also, did any of the other exceptions look interesting?

    -Bryan
    On Sep 29, 2008, at 10:40 AM, Raghu Angadi wrote:

    Raghu Angadi wrote:
    Doug Cutting wrote:
    Raghu Angadi wrote:
    For the current implementation, you need around 3x fds. 1024 is too
    low for Hadoop. The Hadoop requirement will come down, but 1024
    would be too low anyway.
    1024 is the default on many systems. Shouldn't we try to make the
    default configuration work well there?
    How can 1024 work well for different kinds of loads?
    oops! 1024 should work for anyone "working with just one file" for any
    load. I didn't notice that. My comment can be ignored.

    Raghu.
  • Raghu Angadi at Sep 29, 2008 at 5:40 pm

    Raghu Angadi wrote:
    The most interesting one in my eyes is the too many open files one. My
    ulimit is 1024. How much should it be? I don't think that I have that
    many files open in my mappers. They should only be operating on a
    single file at a time. I can try to run the job again and get an lsof
    if it would be interesting.
    oops! my comment does not apply to your case with just single file open.
    You must be hitting some other bug. Please disregard my comment.

    Raghu.
    Thanks for taking the time to reply, by the way.
    For the current implementation, you need around 3x fds. 1024 is too low
    for Hadoop. The Hadoop requirement will come down, but 1024 would be too
    low anyway.

    Raghu.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 26, '08 at 11:36p
activeOct 5, '08 at 6:25a
posts11
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase