FAQ
While I was copying files to hdfs, the hadoop fs client started to
report errors. Digging into the datanode logs revealed [1] that I had
run out of space on one of my datanodes. The namenode (running on the
same machine as the failed datanode) died with a fatal error [2] when
this happened and the logs seem to indicate some kind of corruption. I
am unable to start up my namenode now due to the current state of hdfs
[3].

I stumbled upon HDFS-1378 which implies that manual editing of edit
logs must be done to recover from this. How would one go about doing
this? Are there any other options? Is this expected to happen when a
datanode runs out of space during a copy? I'm not against wiping clean
the data directories of each datanode and reformatting the namenode,
if necessary.

One other part of this scenario that I can't explain is why data was
being written to this node in the first place. This machine was not
listed in the slaves file yet it was still being treated as a
datanode. I realize now that the datanode daemon should not have been
started on this machine but I would imagine that it would be ignored
by the client if it was not in the configuration.

I'm running CDH3b2.

Thanks,
Patrick


[1] datanode log when space ran out:

2010-10-06 10:30:22,995 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-5413202144274811562_223793 src: /128.115.210.46:34712 dest:
/128.115.210.46:50010
2010-10-06 10:30:23,599 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
exception:
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118)
2010-10-06 10:30:23,617 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
receiveBlock for block blk_-5413202144274811562_223793
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space
left on device

[2] namenode log after space ran out:

2010-10-06 10:31:03,675 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync
edit log. Fatal Error.
2010-10-06 10:31:03,675 FATAL
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
storage directories are inaccessible.

[3] namenode log error during startup:
2010-10-06 10:46:35,889 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.IOException: Incorrect data format. logVersion is -18 but
writables.length is 0.
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556)
....

Search Discussions

  • Allen Wittenauer at Oct 6, 2010 at 9:04 pm
    Given this is the third time this has come up in the past two days, I guess we need a new FAQ entry or three.

    We also clearly need to update the quickstart that says:

    a) Do not run a datanode on the namenode.
    b) Make sure dfs.name.dir has two entries, one on a remote box.
    c) The slaves files has nothing to do with what nodes are in the HDFS.

    On Oct 6, 2010, at 1:56 PM, Patrick Marchwiak wrote:

    While I was copying files to hdfs, the hadoop fs client started to
    report errors. Digging into the datanode logs revealed [1] that I had
    run out of space on one of my datanodes. The namenode (running on the
    same machine as the failed datanode) died with a fatal error [2] when
    this happened and the logs seem to indicate some kind of corruption. I
    am unable to start up my namenode now due to the current state of hdfs
    [3].

    I stumbled upon HDFS-1378 which implies that manual editing of edit
    logs must be done to recover from this. How would one go about doing
    this? Are there any other options? Is this expected to happen when a
    datanode runs out of space during a copy? I'm not against wiping clean
    the data directories of each datanode and reformatting the namenode,
    if necessary.

    One other part of this scenario that I can't explain is why data was
    being written to this node in the first place. This machine was not
    listed in the slaves file yet it was still being treated as a
    datanode. I realize now that the datanode daemon should not have been
    started on this machine but I would imagine that it would be ignored
    by the client if it was not in the configuration.

    I'm running CDH3b2.

    Thanks,
    Patrick


    [1] datanode log when space ran out:

    2010-10-06 10:30:22,995 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
    blk_-5413202144274811562_223793 src: /128.115.210.46:34712 dest:
    /128.115.210.46:50010
    2010-10-06 10:30:23,599 WARN
    org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
    exception:
    java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
    at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377)
    at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118)
    2010-10-06 10:30:23,617 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
    receiveBlock for block blk_-5413202144274811562_223793
    org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space
    left on device

    [2] namenode log after space ran out:

    2010-10-06 10:31:03,675 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync
    edit log. Fatal Error.
    2010-10-06 10:31:03,675 FATAL
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
    storage directories are inaccessible.

    [3] namenode log error during startup:
    2010-10-06 10:46:35,889 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.IOException: Incorrect data format. logVersion is -18 but
    writables.length is 0.
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556)
    ....
  • Shrijeet Paliwal at Oct 6, 2010 at 9:07 pm
    One other part of this scenario that I can't explain is why data was
    being written to this node in the first place. This machine was not

    listed in the slaves file yet it was still being treated as a

    datanode.

    Doesnt matter if it was listed in salves file. Was data node running on that
    node?

    I realize now that the datanode daemon should not have been

    started on this machine but I would imagine that it would be ignored

    by the client if it was not in the configuration.

    Oh yes. It was running. Its not ignored if its not mentioned in slaves file.

    Digg into hdfs-user mails sent last week and this week. Couple of similar
    issues were reported. They have a solution.
    On Wed, Oct 6, 2010 at 1:56 PM, Patrick Marchwiak wrote:

    While I was copying files to hdfs, the hadoop fs client started to
    report errors. Digging into the datanode logs revealed [1] that I had
    run out of space on one of my datanodes. The namenode (running on the
    same machine as the failed datanode) died with a fatal error [2] when
    this happened and the logs seem to indicate some kind of corruption. I
    am unable to start up my namenode now due to the current state of hdfs
    [3].

    I stumbled upon HDFS-1378 which implies that manual editing of edit
    logs must be done to recover from this. How would one go about doing
    this? Are there any other options? Is this expected to happen when a
    datanode runs out of space during a copy? I'm not against wiping clean
    the data directories of each datanode and reformatting the namenode,
    if necessary.

    One other part of this scenario that I can't explain is why data was
    being written to this node in the first place. This machine was not
    listed in the slaves file yet it was still being treated as a
    datanode. I realize now that the datanode daemon should not have been
    started on this machine but I would imagine that it would be ignored
    by the client if it was not in the configuration.

    I'm running CDH3b2.

    Thanks,
    Patrick


    [1] datanode log when space ran out:

    2010-10-06 10:30:22,995 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
    blk_-5413202144274811562_223793 src: /128.115.210.46:34712 dest:
    /128.115.210.46:50010
    2010-10-06 10:30:23,599 WARN
    org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
    exception:
    java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:260)
    at
    org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
    at
    org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
    at
    org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377)
    at
    org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118)
    2010-10-06 10:30:23,617 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
    receiveBlock for block blk_-5413202144274811562_223793
    org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space
    left on device

    [2] namenode log after space ran out:

    2010-10-06 10:31:03,675 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync
    edit log. Fatal Error.
    2010-10-06 10:31:03,675 FATAL
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
    storage directories are inaccessible.

    [3] namenode log error during startup:
    2010-10-06 10:46:35,889 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.IOException: Incorrect data format. logVersion is -18 but
    writables.length is 0.
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556)
    ....

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedOct 6, '10 at 8:57p
activeOct 6, '10 at 9:07p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase