get the namenode to start, after patching the source code according to the
attached patch. It is based on the HDFS-1002 patch, but modified and
extended to fix additional NPE. It is made for hadoop 0.20.1.
There seemed to be some corrupt edits and/or some missing files in fsimage
that cause NPE during upstart and merging of the edits into fsimage. Hope
that the attached patch may be of some use for people in similar situations.
We have not run an fsck yet, waiting for a raw copy of the data node data to
complete first. Lets hope that not too much was lost...
Sincerely,
Peter
On Wed, Jul 7, 2010 at 17:31, Jean-Daniel Cryans wrote:
What Alex said, and also it really looks like
https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
of that issue.
J-D
numblocks
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:982)
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:194)
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:615)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
start
in
changes
Is
fsimage,
include
name
wrong.
org.apache.hadoop.ipc.metrics.RpcMetrics:
org.apache.hadoop.metrics.jvm.JvmMetrics:
hbase,hbase
java.io.DataInputStream.readShort(DataInputStream.java:298)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
java.io.DataInputStream.readShort(DataInputStream.java:298)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:881)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:807)
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965
What Alex said, and also it really looks like
https://issues.apache.org/jira/browse/HDFS-1024 from having the experience
of that issue.
J-D
On Wed, Jul 7, 2010 at 8:07 AM, Alex Loddengaard wrote:
Hi Peter,
The edits.new file is used when the edits and fsimage is pulled to the
secondarynamenode. Here's the process:
1) SNN pulls edits and fsimage
2) NN starts writing edits to edits.new
3) SNN sends new fsimage to NN
4) NN replaces its fsimage with the SNN fsimage
5) NN replaces edits with edits.new
Certainly taking a different fsimage and trying to apply edits to it won't
work. Your best bet might be to take the 3-day-old fsimage with an empty
edits and delete edits.new. But before you do any of this, make sure you
completely backup all values for dfs.name.dir and dfs.checkpoint.dir. What
are the timestamps on the fsimage files in each dfs.name.dir and
dfs.checkpoint.dir?
Do the namenode and secondarynamenode have enough disk space? Have you
consulted the logs to learn why the SNN/NN didn't properly update the
fsimage and edits log?
Hope this helps.
Alex
theHi Peter,
The edits.new file is used when the edits and fsimage is pulled to the
secondarynamenode. Here's the process:
1) SNN pulls edits and fsimage
2) NN starts writing edits to edits.new
3) SNN sends new fsimage to NN
4) NN replaces its fsimage with the SNN fsimage
5) NN replaces edits with edits.new
Certainly taking a different fsimage and trying to apply edits to it won't
work. Your best bet might be to take the 3-day-old fsimage with an empty
edits and delete edits.new. But before you do any of this, make sure you
completely backup all values for dfs.name.dir and dfs.checkpoint.dir. What
are the timestamps on the fsimage files in each dfs.name.dir and
dfs.checkpoint.dir?
Do the namenode and secondarynamenode have enough disk space? Have you
consulted the logs to learn why the SNN/NN didn't properly update the
fsimage and edits log?
Hope this helps.
Alex
On Wed, Jul 7, 2010 at 7:34 AM, Peter Falk wrote:
Just a little update. We found a working fsimage that was just a couple of
days older than the corrupt one. We tried to replace the fsimage with
Just a little update. We found a working fsimage that was just a couple of
days older than the corrupt one. We tried to replace the fsimage with
working one, and kept the edits and edits.new files, hoping the the latest
edits would be still in use. However, when starting the namenode, the
following error message appears. Any thought ideas or hints of how to
continue? Edit the edits files somehow?
TIA,
Peter
2010-07-07 16:21:10,312 INFO
org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 28372
2010-07-07 16:21:11,162 INFO
org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 8
2010-07-07 16:21:11,164 INFO
org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 3315887 loaded in 0 seconds.
2010-07-07 16:21:11,164 DEBUG
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
/hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
edits would be still in use. However, when starting the namenode, the
following error message appears. Any thought ideas or hints of how to
continue? Edit the edits files somehow?
TIA,
Peter
2010-07-07 16:21:10,312 INFO
org.apache.hadoop.hdfs.server.common.Storage:
Number of files = 28372
2010-07-07 16:21:11,162 INFO
org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 8
2010-07-07 16:21:11,164 INFO
org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 3315887 loaded in 0 seconds.
2010-07-07 16:21:11,164 DEBUG
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 9:
/hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423
:
it1
clientHolder clientMachine
2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove
/hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
clientHolder clientMachine
2010-07-07 16:21:11,164 DEBUG org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove
/hbase/.logs/miller,60020,1274447474064/hlog.dat.1274706452423 because
does not exist
2010-07-07 16:21:11,164 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
java.lang.NullPointerException
at
2010-07-07 16:21:11,164 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
java.lang.NullPointerException
at
at
at
at
at
at
at
at
at
at
at
at
at
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
2010-07-07 16:21:11,165 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
************************************************************/
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
2010-07-07 16:21:11,165 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at fanta/192.168.10.53
************************************************************/
On Wed, Jul 7, 2010 at 14:46, Peter Falk wrote:
Hi,
After a restart of our live cluster today, the name node fails to
Hi,
After a restart of our live cluster today, the name node fails to
with the log message seen below. There is a big file called edits.new
the
"current" folder that seems be the only one that have received
recently (no changes to the edits or the fsimage for over a month).
that
thenormal?
The last change to the edits.new file was right before shutting down
The last change to the edits.new file was right before shutting down
cluster. It seems like the shutdown was unable to store valid
edits, edits.new files. The secondary name node image does not
the
edits.new file, only edits and fsimage, which are identical to the
nodes version. So no help from them.
Would appreciate any help in understanding what could have gone
Would appreciate any help in understanding what could have gone
The
ourshutdown seemed to complete just fine, without any error message. Is there
any way to recreate the image from the data, or any other way to save
any way to recreate the image from the data, or any other way to save
production data?
Sincerely,
Peter
2010-07-07 14:30:26,949 INFO
Sincerely,
Peter
2010-07-07 14:30:26,949 INFO
Initializing RPC Metrics with hostName=NameNode, port=9000
2010-07-07 14:30:26,960 INFO
2010-07-07 14:30:26,960 INFO
Initializing JVM Metrics with processName=NameNode, sessionId=null
2010-07-07 14:30:27,019 DEBUG
org.apache.hadoop.security.UserGroupInformation: Unix Login:
2010-07-07 14:30:27,019 DEBUG
org.apache.hadoop.security.UserGroupInformation: Unix Login:
2010-07-07 14:30:27,149 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
at
at
at
at
at
at
at
at
at
at
at
2010-07-07 14:30:27,150 INFO org.apache.hadoop.ipc.Server: Stopping server
on 9000
2010-07-07 14:30:27,151 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
at
on 9000
2010-07-07 14:30:27,151 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
at
at
at
at
at
at
at
at
at
at
at