Just in case someone's curious.
Stop and restart dfs with 0.13.1:
- master name node says:
2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode up
at: hadoop001.sf2p.facebook.com/10.16.159.101:9000
2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because
it does not exist
2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest
ination exists
2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be
cause destination exists
2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest
ination exists
2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be
cause destination exists
2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest
ination exists
...
there's a serious blast of these (replaying edit log?). In any case -
after this is done - it enters safemode - presume the fs is corrupted by
then. At the exact same time - the datanodes are busy deleting blocks!:
2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren
t'}
2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
BLOCKREPORT_INTERVAL of 3588023msec
2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir18/blk_-9223045762536565560
2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir12/blk_-9214178286744587840
2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir20/blk_-9213127144044535407
2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir26/blk_-9211625398030978419
2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir24/blk_-9189558923884323865
2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu
rrent/subdir10/blk_-9115468136273900585
ouch - I guess those are all the blocks that fsck is now reporting
missing. Known bug? Operator error? (well - I did do a clean shutdown
..)
-----Original Message-----
From: Joydeep Sen Sarma
Sent: Friday, August 24, 2007 7:21 PM
To: hadoop-user@lucene.apache.org
Subject: RE: secondary namenode errors
I wish I had read the bug more carefully - thought that the issue was
fixed in 0.13.1.
Of course not, the issue persists. Meanwhile - half the files are
corrupted after the upgrade (followed the upgrade wiki, tried to restore
to backed up metadata and old version - to no avail).
Sigh - have a nice weekend everyone,
Joydeep
-----Original Message-----
From: Koji Noguchi
Sent: Friday, August 24, 2007 8:29 AM
To: hadoop-user@lucene.apache.org
Subject: Re: secondary namenode errors
Joydeep,
I think you're hitting this bug.
http://issues.apache.org/jira/browse/HADOOP-1076In any case, as Raghu suggested, please use 0.13.1 and not 0.13.
Koji
Raghu Angadi wrote:
Joydeep Sen Sarma wrote:
Thanks for replying.
>>
Can you please clarify - is it the case that the secondary namenode
stuff only works in 0.13.1? and what's the connection with
replication
>>
We lost the file system completely once, trying to make sure we can
>
I am not sure if the problem you reported still exists in 0.13.1. You
might still have the problem and you can ask again. But you should
move to 0.13.1 since it has some critical fixes. See release notes for
0.13.1 or HADOOP-1603. You should always upgrade to the latest minor
release version when moving to next major version. >
Raghu.
>
>>
-----Original Message-----
From: Raghu Angadi Sent: Thursday,
To: hadoop-user@lucene.apache.org
Subject: Re: secondary namenode errors
>>
>>
On a related note, please don't use 0.13.0, use the latest released
version for 0.13 (I think it is 0.13.1). If the secondary namenode
actually works, then it will resulting all the replications set to 1.
>>
>>
Joydeep Sen Sarma wrote:
Hi folks,