Aaron T. Myers
On Tue, Mar 26, 2013 at 8:42 AM, Dan Bongert wrote:Thanks Todd, that did allow the namenode to start up. However, I'm still
baffled why this happened in the first place (and how to prevent it in the
future). Why are there tens of thousands of edit files? Is it because I
either forgot to finalize the metadata after the last upgrade I did, or
that the finalization didn't complete properly? Or something else entirely?
Obviously it's not normal behavior, and I'd like to fix the problem before
it gets worse.
On Mar 25, 2013, at 3:08 AM, Todd Lipcon wrote:Hi Dan,
Can you try setting dfs.namenode.max.extra.edits.segments.retained to a
very large number, like 1000000?
I think this should workaround the bug you're hitting.
Thanks
-Todd
On Thu, Mar 21, 2013 at 7:58 AM, Dan Bongert wrote:
Here's the end of the file. There are tons of sequences like it starts
with, and then it decides that it can't start the namenode.
http://pastebin.com/WBMdjbF1On Mar 21, 2013, at 9:53 AM, Harsh J wrote:Hi Dan,
Can you pastebin or send along your full crashing NN log to us so we
can take a closer look?
Note that the "image/fsimage" is not the valid fsimage. The current/
dir is what you should look at and is the right fsimage.
On Thu, Mar 21, 2013 at 8:20 PM, Dan Bongert wrote:
I reviewed my records, and Jan 30 was indeed when I upgraded from
CDH3u5 to 4.1.2, so that makes sense that the previous.checkpoint would be
from then.
My first thought was that I didn't finalize the metadata after the
3u5 -> 4.1.2 upgrade, but I'm pretty sure I did that a couple weeks
afterwards. I am not sure if the presence of multiple fsimage files
corroborates that or not:
hadoop1(6) /data/2/dfs/nn/current# ll *fsimage*
-rw-r--r-- 1 hdfs hdfs 1036854 Feb 18 11:40
fsimage_0000000000000075813
-rw-r--r-- 1 hdfs hdfs 62 Feb 18 11:40
fsimage_0000000000000075813.md5
-rw-r--r-- 1 hdfs hdfs 1077298 Mar 20 13:18
fsimage_0000000000000185306
-rw-r--r-- 1 hdfs hdfs 62 Mar 20 13:18
fsimage_0000000000000185306.md5
Of course the real fsimage file is dated Jan 30:
-rw-r--r-- 1 hdfs hdfs 157 Jan 30 13:52 /data/2/dfs/nn/image/fsimage
And the secondary name node info also hasn't been updated since Jan
30 as well:
-rw-r--r-- 1 hdfs hdfs 157 Jan 30 13:52 /data/2/dfs/snn/image/fsimage
Realistically, what are my options at this point? I assume that if I
delete the 70k edits_* files, I'm going to lose nearly two months of edits?
Or will the actual data on disk not be changed?
On Mar 20, 2013, at 2:03 PM, Dan Bongert wrote:
Of course I discovered this after it became a problem, and now am
not sure what to do next without losing data/metadata.
Background: I just added two nodes to a CDH 4.1.3 cluster. The new
nodes got installed automatically with 4.2, so I went ahead and upgraded
the other nodes too. Now, after the upgrade, the namenode won't start.
This is the around the end of the logfile:
java.lang.AssertionError: Should not purge more edits than required
to restore: 165186 should be <= 75814
The /data/N/dfs/nn/current directories have a TON of files in them
(70,130 in each, replicated across 3 directories), which leads me to
believe the metadata isn't updating properly. The previous.checkpoint
directory is from Jan 30!
My first thought, before I dug into it further, is that my upgrade
to 4.1.3 (on February 12-13) caused the issue, but it doesn't make sense
that the old data would predate that.
The namenode log indicates it dutifully goes through all the edits:
2013-03-20 13:18:42,836 INFO
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding
stream
'/data/3/dfs/nn/current/edits_0000000000000185306-0000000000000185306' to
transaction ID 75814
2013-03-20 13:18:42,837 INFO
org.apache.hadoop.hdfs.server.namenode.FSImage: Edits file
/data/3/dfs/nn/current/edits_0000000000000185306-0000000000000185306,
/data/4/dfs/nn/current/edits_0000000000000185306-0000000000000185306,
/data/2/dfs/nn/current/edits_0000000000000185306-0000000000000185306 of
size 1048576 edits # 1 loaded in 0 seconds
2013-03-20 13:18:42,888 INFO
org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file
/data/3/dfs/nn/current/fsimage.ckpt_0000000000000185306 using no compression
2013-03-20 13:18:42,888 INFO
org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file
/data/2/dfs/nn/current/fsimage.ckpt_0000000000000185306 using no compression
2013-03-20 13:18:42,888 INFO
org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file
/data/4/dfs/nn/current/fsimage.ckpt_0000000000000185306 using no compression
But then decides there are too many? And I have no idea where it is
getting the numbers 165,186 or 75,814 (as I said, there are 70,130 edits_
files in the current directory). I don't think any metadata was lost, but
am not sure.
--
Dan Bongert
[email protected]SSCC Linux System Administrator (608) 262-9857
--
Todd Lipcon
Software Engineer, Cloudera
--
--
Dan Bongert
[email protected]SSCC Linux System Administrator (608) 262-9857