FAQ
Hi,

After a restart of our live cluster today, the name node fails to start with
the log message seen below. There is a big file called edits.new in the
"current" folder that seems be the only one that have received changes
recently (no changes to the edits or the fsimage for over a month). Is that
normal?

The last change to the edits.new file was right before shutting down the
cluster. It seems like the shutdown was unable to store valid fsimage,
edits, edits.new files. The secondary name node image does not include the
edits.new file, only edits and fsimage, which are identical to the name
nodes version. So no help from them.

Would appreciate any help in understanding what could have gone wrong. The
shutdown seemed to complete just fine, without any error message. Is there
any way to recreate the image from the data, or any other way to save our
production data?

Sincerely,
Peter

2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2010-07-07 14:30:27,019 DEBUG
org.apache.hadoop.security.UserGroupInformation: Unix Login: hbase,hbase
2010-07-07 14:30:27,149 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(NameNode.java:201)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:956)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
on 9000
2010-07-07 14:30:27,151 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:298)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(NameNode.java:201)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:956)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965

Search Discussions

  • Peter Falk at Jul 7, 2010 at 2:35 pm
    Just a little update. We found a working fsimage that was just a couple of
    days older than the corrupt one. We tried to replace the fsimage with the
    working one, and kept the edits and edits.new files, hoping the the latest
    edits would be still in use. However, when starting the namenode, the
    following error message appears. Any thought ideas or hints of how to
    continue? Edit the edits files somehow?

    TIA,
    Peter

    2010-07-07 16:21:10,312 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 28372
    2010-07-07 16:21:11,162 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 8
    2010-07-07 16:21:11,164 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 3315887 loaded in 0 seconds.
    2010-07-07 16:21:11,164 DEBUG
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks : 1
    clientHolder clientMachine
    2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
    does not exist
    2010-07-07 16:21:11,164 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

    2010-07-07 16:21:11,165 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
    ************************************************************/

    On Wed, Jul 7, 2010 at 14:46, Peter Falk wrote:

    Hi,

    After a restart of our live cluster today, the name node fails to start
    with the log message seen below. There is a big file called edits.new in the
    "current" folder that seems be the only one that have received changes
    recently (no changes to the edits or the fsimage for over a month). Is that
    normal?

    The last change to the edits.new file was right before shutting down the
    cluster. It seems like the shutdown was unable to store valid fsimage,
    edits, edits.new files. The secondary name node image does not include the
    edits.new file, only edits and fsimage, which are identical to the name
    nodes version. So no help from them.

    Would appreciate any help in understanding what could have gone wrong. The
    shutdown seemed to complete just fine, without any error message. Is there
    any way to recreate the image from the data, or any other way to save our
    production data?

    Sincerely,
    Peter

    2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2010-07-07 14:30:27,019 DEBUG
    org.apache.hadoop.security.UserGroupInformation: Unix Login: hbase,hbase
    2010-07-07 14:30:27,149 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
    on 9000
    2010-07-07 14:30:27,151 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
  • Alex Loddengaard at Jul 7, 2010 at 3:09 pm
    Hi Peter,

    The edits.new file is used when the edits and fsimage is pulled to the
    secondarynamenode. Here's the process:

    1) SNN pulls edits and fsimage
    2) NN starts writing edits to edits.new
    3) SNN sends new fsimage to NN
    4) NN replaces its fsimage with the SNN fsimage
    5) NN replaces edits with edits.new

    Certainly taking a different fsimage and trying to apply edits to it won't
    work. Your best bet might be to take the 3-day-old fsimage with an empty
    edits and delete edits.new. But before you do any of this, make sure you
    completely backup all values for dfs.name.dir and dfs.checkpoint.dir. What
    are the timestamps on the fsimage files in each dfs.name.dir and
    dfs.checkpoint.dir?

    Do the namenode and secondarynamenode have enough disk space? Have you
    consulted the logs to learn why the SNN/NN didn't properly update the
    fsimage and edits log?

    Hope this helps.

    Alex
    On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk wrote:

    Just a little update. We found a working fsimage that was just a couple of
    days older than the corrupt one. We tried to replace the fsimage with the
    working one, and kept the edits and edits.new files, hoping the the latest
    edits would be still in use. However, when starting the namenode, the
    following error message appears. Any thought ideas or hints of how to
    continue? Edit the edits files somehow?

    TIA,
    Peter

    2010-07-07 16:21:10,312 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 28372
    2010-07-07 16:21:11,162 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 8
    2010-07-07 16:21:11,164 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 3315887 loaded in 0 seconds.
    2010-07-07 16:21:11,164 DEBUG
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks :
    1
    clientHolder clientMachine
    2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
    does not exist
    2010-07-07 16:21:11,164 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
    at

    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
    at

    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
    at

    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
    at

    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at

    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at

    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at

    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at

    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

    2010-07-07 16:21:11,165 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
    ************************************************************/

    On Wed, Jul 7, 2010 at 14:46, Peter Falk wrote:

    Hi,

    After a restart of our live cluster today, the name node fails to start
    with the log message seen below. There is a big file called edits.new in the
    "current" folder that seems be the only one that have received changes
    recently (no changes to the edits or the fsimage for over a month). Is that
    normal?

    The last change to the edits.new file was right before shutting down the
    cluster. It seems like the shutdown was unable to store valid fsimage,
    edits, edits.new files. The secondary name node image does not include the
    edits.new file, only edits and fsimage, which are identical to the name
    nodes version. So no help from them.

    Would appreciate any help in understanding what could have gone wrong. The
    shutdown seemed to complete just fine, without any error message. Is there
    any way to recreate the image from the data, or any other way to save our
    production data?

    Sincerely,
    Peter

    2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2010-07-07 14:30:27,019 DEBUG
    org.apache.hadoop.security.UserGroupInformation: Unix Login: hbase,hbase
    2010-07-07 14:30:27,149 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
    on 9000
    2010-07-07 14:30:27,151 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
  • Jean-Daniel Cryans at Jul 7, 2010 at 3:33 pm
    What Alex said, and also it really looks like
    https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
    of that issue.

    J-D
    On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard wrote:

    Hi Peter,

    The edits.new file is used when the edits and fsimage is pulled to the
    secondarynamenode. Here's the process:

    1) SNN pulls edits and fsimage
    2) NN starts writing edits to edits.new
    3) SNN sends new fsimage to NN
    4) NN replaces its fsimage with the SNN fsimage
    5) NN replaces edits with edits.new

    Certainly taking a different fsimage and trying to apply edits to it won't
    work. Your best bet might be to take the 3-day-old fsimage with an empty
    edits and delete edits.new. But before you do any of this, make sure you
    completely backup all values for dfs.name.dir and dfs.checkpoint.dir. What
    are the timestamps on the fsimage files in each dfs.name.dir and
    dfs.checkpoint.dir?

    Do the namenode and secondarynamenode have enough disk space? Have you
    consulted the logs to learn why the SNN/NN didn't properly update the
    fsimage and edits log?

    Hope this helps.

    Alex
    On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk wrote:

    Just a little update. We found a working fsimage that was just a couple of
    days older than the corrupt one. We tried to replace the fsimage with the
    working one, and kept the edits and edits.new files, hoping the the latest
    edits would be still in use. However, when starting the namenode, the
    following error message appears. Any thought ideas or hints of how to
    continue? Edit the edits files somehow?

    TIA,
    Peter

    2010-07-07 16:21:10,312 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 28372
    2010-07-07 16:21:11,162 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 8
    2010-07-07 16:21:11,164 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 3315887 loaded in 0 seconds.
    2010-07-07 16:21:11,164 DEBUG
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 numblocks :
    1
    clientHolder clientMachine
    2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because it
    does not exist
    2010-07-07 16:21:11,164 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
    at

    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
    at

    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
    at

    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
    at

    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at

    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at

    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at

    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at

    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at

    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

    2010-07-07 16:21:11,165 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
    ************************************************************/

    On Wed, Jul 7, 2010 at 14:46, Peter Falk wrote:

    Hi,

    After a restart of our live cluster today, the name node fails to start
    with the log message seen below. There is a big file called edits.new
    in
    the
    "current" folder that seems be the only one that have received changes
    recently (no changes to the edits or the fsimage for over a month). Is that
    normal?

    The last change to the edits.new file was right before shutting down
    the
    cluster. It seems like the shutdown was unable to store valid fsimage,
    edits, edits.new files. The secondary name node image does not include the
    edits.new file, only edits and fsimage, which are identical to the name
    nodes version. So no help from them.

    Would appreciate any help in understanding what could have gone wrong. The
    shutdown seemed to complete just fine, without any error message. Is there
    any way to recreate the image from the data, or any other way to save
    our
    production data?

    Sincerely,
    Peter

    2010-07-07 14:30:26,949 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2010-07-07 14:30:26,960 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2010-07-07 14:30:27,019 DEBUG
    org.apache.hadoop.security.UserGroupInformation: Unix Login:
    hbase,hbase
    2010-07-07 14:30:27,149 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
    on 9000
    2010-07-07 14:30:27,151 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
    at java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
  • Peter Falk at Jul 7, 2010 at 6:05 pm
    Thanks for the information Alex and Jean-Daniel! We have finally be able to
    get the namenode to start, after patching the source code according to the
    attached patch. It is based on the HDFS-1002 patch, but modified and
    extended to fix additional NPE. It is made for hadoop 0.20.1.

    There seemed to be some corrupt edits and/or some missing files in fsimage
    that cause NPE during upstart and merging of the edits into fsimage. Hope
    that the attached patch may be of some use for people in similar situations.
    We have not run an fsck yet, waiting for a raw copy of the data node data to
    complete first. Lets hope that not too much was lost...

    Sincerely,
    Peter
    On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans wrote:

    What Alex said, and also it really looks like
    https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
    of that issue.

    J-D
    On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard wrote:

    Hi Peter,

    The edits.new file is used when the edits and fsimage is pulled to the
    secondarynamenode. Here's the process:

    1) SNN pulls edits and fsimage
    2) NN starts writing edits to edits.new
    3) SNN sends new fsimage to NN
    4) NN replaces its fsimage with the SNN fsimage
    5) NN replaces edits with edits.new

    Certainly taking a different fsimage and trying to apply edits to it won't
    work. Your best bet might be to take the 3-day-old fsimage with an empty
    edits and delete edits.new. But before you do any of this, make sure you
    completely backup all values for dfs.name.dir and dfs.checkpoint.dir. What
    are the timestamps on the fsimage files in each dfs.name.dir and
    dfs.checkpoint.dir?

    Do the namenode and secondarynamenode have enough disk space? Have you
    consulted the logs to learn why the SNN/NN didn't properly update the
    fsimage and edits log?

    Hope this helps.

    Alex
    On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk wrote:

    Just a little update. We found a working fsimage that was just a couple of
    days older than the corrupt one. We tried to replace the fsimage with
    the
    working one, and kept the edits and edits.new files, hoping the the latest
    edits would be still in use. However, when starting the namenode, the
    following error message appears. Any thought ideas or hints of how to
    continue? Edit the edits files somehow?

    TIA,
    Peter

    2010-07-07 16:21:10,312 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 28372
    2010-07-07 16:21:11,162 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 8
    2010-07-07 16:21:11,164 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 3315887 loaded in 0 seconds.
    2010-07-07 16:21:11,164 DEBUG
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
    numblocks
    :
    1
    clientHolder clientMachine
    2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
    it
    does not exist
    2010-07-07 16:21:11,164 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

    2010-07-07 16:21:11,165 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
    ************************************************************/

    On Wed, Jul 7, 2010 at 14:46, Peter Falk wrote:

    Hi,

    After a restart of our live cluster today, the name node fails to
    start
    with the log message seen below. There is a big file called edits.new
    in
    the
    "current" folder that seems be the only one that have received
    changes
    recently (no changes to the edits or the fsimage for over a month).
    Is
    that
    normal?

    The last change to the edits.new file was right before shutting down
    the
    cluster. It seems like the shutdown was unable to store valid
    fsimage,
    edits, edits.new files. The secondary name node image does not
    include
    the
    edits.new file, only edits and fsimage, which are identical to the
    name
    nodes version. So no help from them.

    Would appreciate any help in understanding what could have gone
    wrong.
    The
    shutdown seemed to complete just fine, without any error message. Is there
    any way to recreate the image from the data, or any other way to save
    our
    production data?

    Sincerely,
    Peter

    2010-07-07 14:30:26,949 INFO
    org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2010-07-07 14:30:26,960 INFO
    org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2010-07-07 14:30:27,019 DEBUG
    org.apache.hadoop.security.UserGroupInformation: Unix Login:
    hbase,hbase
    2010-07-07 14:30:27,149 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.EOFException
    at
    java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
    on 9000
    2010-07-07 14:30:27,151 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
    at
    java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
  • Peter Falk at Jul 7, 2010 at 8:26 pm
    FYI, just a small update. After starting the data nodes, the block reporting
    ratio was only 68% and the name node never went out of safe mode.
    Apparently, too many edits was lost. We have resorted to formatting the
    cluster for now, we have backup of the most essential data and have started
    restoring that data.

    Of course, it is very disappointing with this data loss. We have kept copies
    of datanode data, as well as the corrupt fsimage and edits. If any one have
    any idea of how to restore the data, either by better merging the edits or
    by reconstructing fsimage from the datanode data somehow, please let me
    know!

    Time to get some sleep now, it has been a long day...

    Sincerely,
    Peter
    On Wed, Jul 7, 2010 at 20:03, Peter Falk wrote:

    Thanks for the information Alex and Jean-Daniel! We have finally be able to
    get the namenode to start, after patching the source code according to the
    attached patch. It is based on the HDFS-1002 patch, but modified and
    extended to fix additional NPE. It is made for hadoop 0.20.1.

    There seemed to be some corrupt edits and/or some missing files in fsimage
    that cause NPE during upstart and merging of the edits into fsimage. Hope
    that the attached patch may be of some use for people in similar situations.
    We have not run an fsck yet, waiting for a raw copy of the data node data to
    complete first. Lets hope that not too much was lost...

    Sincerely,
    Peter

    On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans wrote:

    What Alex said, and also it really looks like
    https://issues.apache.org/jira/browse/HDFS-1024 from having the
    experience
    of that issue.

    J-D

    On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <alex@cloudera.com>
    wrote:
    Hi Peter,

    The edits.new file is used when the edits and fsimage is pulled to the
    secondarynamenode. Here's the process:

    1) SNN pulls edits and fsimage
    2) NN starts writing edits to edits.new
    3) SNN sends new fsimage to NN
    4) NN replaces its fsimage with the SNN fsimage
    5) NN replaces edits with edits.new

    Certainly taking a different fsimage and trying to apply edits to it won't
    work. Your best bet might be to take the 3-day-old fsimage with an empty
    edits and delete edits.new. But before you do any of this, make sure you
    completely backup all values for dfs.name.dir and dfs.checkpoint.dir. What
    are the timestamps on the fsimage files in each dfs.name.dir and
    dfs.checkpoint.dir?

    Do the namenode and secondarynamenode have enough disk space? Have you
    consulted the logs to learn why the SNN/NN didn't properly update the
    fsimage and edits log?

    Hope this helps.

    Alex
    On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk wrote:

    Just a little update. We found a working fsimage that was just a
    couple
    of
    days older than the corrupt one. We tried to replace the fsimage with
    the
    working one, and kept the edits and edits.new files, hoping the the latest
    edits would be still in use. However, when starting the namenode, the
    following error message appears. Any thought ideas or hints of how to
    continue? Edit the edits files somehow?

    TIA,
    Peter

    2010-07-07 16:21:10,312 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 28372
    2010-07-07 16:21:11,162 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 8
    2010-07-07 16:21:11,164 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 3315887 loaded in 0 seconds.
    2010-07-07 16:21:11,164 DEBUG
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
    numblocks
    :
    1
    clientHolder clientMachine
    2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
    it
    does not exist
    2010-07-07 16:21:11,164 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 16:21:11,165 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
    ************************************************************/

    On Wed, Jul 7, 2010 at 14:46, Peter Falk wrote:

    Hi,

    After a restart of our live cluster today, the name node fails to
    start
    with the log message seen below. There is a big file called
    edits.new
    in
    the
    "current" folder that seems be the only one that have received
    changes
    recently (no changes to the edits or the fsimage for over a month).
    Is
    that
    normal?

    The last change to the edits.new file was right before shutting down
    the
    cluster. It seems like the shutdown was unable to store valid
    fsimage,
    edits, edits.new files. The secondary name node image does not
    include
    the
    edits.new file, only edits and fsimage, which are identical to the
    name
    nodes version. So no help from them.

    Would appreciate any help in understanding what could have gone
    wrong.
    The
    shutdown seemed to complete just fine, without any error message. Is there
    any way to recreate the image from the data, or any other way to
    save
    our
    production data?

    Sincerely,
    Peter

    2010-07-07 14:30:26,949 INFO
    org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2010-07-07 14:30:26,960 INFO
    org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2010-07-07 14:30:27,019 DEBUG
    org.apache.hadoop.security.UserGroupInformation: Unix Login:
    hbase,hbase
    2010-07-07 14:30:27,149 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.EOFException
    at
    java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
    on 9000
    2010-07-07 14:30:27,151 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.io.EOFException
    at
    java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
  • Michael Segel at Jul 9, 2010 at 2:39 pm
    I know this is a little late in the game...

    You could have forced the cluster out of safe mode and then use fsck to copy the bad blocks out to the file system. (See the help on fsck)

    While that might not have helped recover lost data, it would have gotten your cloud back.

    I would also find out where most of the corruption occurred. It sounds like you may have a bad disk.

    HTH

    -Mike

    From: peter@bugsoft.nu
    Date: Wed, 7 Jul 2010 22:25:27 +0200
    Subject: Re: Please help! Corrupt fsimage?
    To: common-user@hadoop.apache.org

    FYI, just a small update. After starting the data nodes, the block reporting
    ratio was only 68% and the name node never went out of safe mode.
    Apparently, too many edits was lost. We have resorted to formatting the
    cluster for now, we have backup of the most essential data and have started
    restoring that data.

    Of course, it is very disappointing with this data loss. We have kept copies
    of datanode data, as well as the corrupt fsimage and edits. If any one have
    any idea of how to restore the data, either by better merging the edits or
    by reconstructing fsimage from the datanode data somehow, please let me
    know!

    Time to get some sleep now, it has been a long day...

    Sincerely,
    Peter
    On Wed, Jul 7, 2010 at 20:03, Peter Falk wrote:

    Thanks for the information Alex and Jean-Daniel! We have finally be able to
    get the namenode to start, after patching the source code according to the
    attached patch. It is based on the HDFS-1002 patch, but modified and
    extended to fix additional NPE. It is made for hadoop 0.20.1.

    There seemed to be some corrupt edits and/or some missing files in fsimage
    that cause NPE during upstart and merging of the edits into fsimage. Hope
    that the attached patch may be of some use for people in similar situations.
    We have not run an fsck yet, waiting for a raw copy of the data node data to
    complete first. Lets hope that not too much was lost...

    Sincerely,
    Peter

    On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans wrote:

    What Alex said, and also it really looks like
    https://issues.apache.org/jira/browse/HDFS-1024 from having the
    experience
    of that issue.

    J-D

    On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard <alex@cloudera.com>
    wrote:
    Hi Peter,

    The edits.new file is used when the edits and fsimage is pulled to the
    secondarynamenode. Here's the process:

    1) SNN pulls edits and fsimage
    2) NN starts writing edits to edits.new
    3) SNN sends new fsimage to NN
    4) NN replaces its fsimage with the SNN fsimage
    5) NN replaces edits with edits.new

    Certainly taking a different fsimage and trying to apply edits to it won't
    work. Your best bet might be to take the 3-day-old fsimage with an empty
    edits and delete edits.new. But before you do any of this, make sure you
    completely backup all values for dfs.name.dir and dfs.checkpoint.dir. What
    are the timestamps on the fsimage files in each dfs.name.dir and
    dfs.checkpoint.dir?

    Do the namenode and secondarynamenode have enough disk space? Have you
    consulted the logs to learn why the SNN/NN didn't properly update the
    fsimage and edits log?

    Hope this helps.

    Alex
    On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk wrote:

    Just a little update. We found a working fsimage that was just a
    couple
    of
    days older than the corrupt one. We tried to replace the fsimage with
    the
    working one, and kept the edits and edits.new files, hoping the the latest
    edits would be still in use. However, when starting the namenode, the
    following error message appears. Any thought ideas or hints of how to
    continue? Edit the edits files somehow?

    TIA,
    Peter

    2010-07-07 16:21:10,312 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files = 28372
    2010-07-07 16:21:11,162 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Number of files under construction = 8
    2010-07-07 16:21:11,164 INFO
    org.apache.hadoop.hdfs.server.common.Storage:
    Image file of size 3315887 loaded in 0 seconds.
    2010-07-07 16:21:11,164 DEBUG
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
    numblocks
    :
    1
    clientHolder clientMachine
    2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove
    /hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
    it
    does not exist
    2010-07-07 16:21:11,164 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 16:21:11,165 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
    ************************************************************/

    On Wed, Jul 7, 2010 at 14:46, Peter Falk wrote:

    Hi,

    After a restart of our live cluster today, the name node fails to
    start
    with the log message seen below. There is a big file called
    edits.new
    in
    the
    "current" folder that seems be the only one that have received
    changes
    recently (no changes to the edits or the fsimage for over a month).
    Is
    that
    normal?

    The last change to the edits.new file was right before shutting down
    the
    cluster. It seems like the shutdown was unable to store valid
    fsimage,
    edits, edits.new files. The secondary name node image does not
    include
    the
    edits.new file, only edits and fsimage, which are identical to the
    name
    nodes version. So no help from them.

    Would appreciate any help in understanding what could have gone
    wrong.
    The
    shutdown seemed to complete just fine, without any error message. Is there
    any way to recreate the image from the data, or any other way to
    save
    our
    production data?

    Sincerely,
    Peter

    2010-07-07 14:30:26,949 INFO
    org.apache.hadoop.ipc.metrics.RpcMetrics:
    Initializing RPC Metrics with hostName=NameNode, port=9000
    2010-07-07 14:30:26,960 INFO
    org.apache.hadoop.metrics.jvm.JvmMetrics:
    Initializing JVM Metrics with processName=NameNode, sessionId=null
    2010-07-07 14:30:27,019 DEBUG
    org.apache.hadoop.security.UserGroupInformation: Unix Login:
    hbase,hbase
    2010-07-07 14:30:27,149 ERROR
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
    initialization failed.
    java.io.EOFException
    at
    java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
    2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
    on 9000
    2010-07-07 14:30:27,151 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.io.EOFException
    at
    java.io.DataInputStream.readShort(DataInputStream.java:298)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
    _________________________________________________________________
    The New Busy is not the old busy. Search, chat and e-mail from your inbox.
    http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_3

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 7, '10 at 12:47p
activeJul 9, '10 at 2:39p
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase