FAQ
We have a corrupt edits and edits.new. Also, we did not notice that
our nn2 was not functioning for over a month. Long story short, now we
have a fsimage of 1 month old, a small edits (just standard 1 hour)
and a huge edits.new. The reason of corruption is not certain, but
probably because of the disk full at one time.

When we start the nn from the cm, we see the following log in nn:

2012-06-18 16:07:41,452 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files =
1883781
2012-06-18 16:08:05,669 INFO
org.apache.hadoop.hdfs.server.common.Storage: Number of files under
construction = 23
2012-06-18 16:08:05,674 INFO
org.apache.hadoop.hdfs.server.common.Storage: Image file of size
283050614 loaded in 24 seconds.
2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
myServerName,60020,1333115929432-splitting/myServerName
%3A60020.1337297156177 because source does not exist
... SOME MORE FILES ...
2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
myServerName,60020,1333115929432-splitting/myServerName
%3A60020.1337221541572 to /hbase/.oldlogs/myServerName
%3A60020.1337221541572 because source does not exist
... SOME MORE FILES ...
2012-06-18 16:08:05,707 ERROR
org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
at offset 40548
Recent opcode offsets: 39902 40055 40214 40367
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:
1099)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:
1111)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:
1014)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:
217)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:
693)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
1043)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
853)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:
383)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:
110)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:
372)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:
335)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
271)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
467)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:
1330)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
1339)


Starting the nn in recovery mode just skips all edits from the corrupt
entries. (Very early on in the file, so recovery like that is
definitely not usable).

So, can we upgrade to cdh4 to use the new recovery functions? Is it
even possible to do that in the current state?

As far as we know HBase and other files on the HDFS were all
functioning healthy until we restarted the HDFS. (Putting it in the
error state it is now in).

Search Discussions

  • Colin Patrick McCabe at Jun 18, 2012 at 6:46 pm
    Hi Ferdy,

    This would be a good place to use NameNode recovery. I wrote a blog
    post about it at:

    http://www.cloudera.com/blog/2012/05/namenode-recovery-tools-for-the-hadoop-distributed-file-system/

    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.

    sincerely,
    Colin

    On Jun 18, 7:36 am, Ferdy Galema wrote:
    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.

    When we start the nn from the cm, we see the following log in nn:

    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory. java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)

    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).

    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?

    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).
  • Ferdy Galema at Jun 18, 2012 at 7:09 pm
    Do you suggest we upgrade our entire cluster or can we just the run
    cdh4 recovery tool binary agains the namenode directory? (We rather
    not upgrade yet.)

    Thanks.
    On Jun 18, 8:46 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,

    This would be a good place to use NameNode recovery.  I wrote a blog
    post about it at:

    http://www.cloudera.com/blog/2012/05/namenode-recovery-tools-for-the-...

    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.

    sincerely,
    Colin

    On Jun 18, 7:36 am, Ferdy Galema wrote:






    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.
    When we start the nn from the cm, we see the following log in nn:
    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory. java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)
    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).
    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?
    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).
  • Vinithra Varadharajan at Jun 18, 2012 at 7:15 pm
    Ferdy,

    I believe this feature is available with CDH3u4. So you should be fine
    without upgrading.

    -Vinithra
    On Mon, Jun 18, 2012 at 12:09 PM, Ferdy Galema wrote:

    Do you suggest we upgrade our entire cluster or can we just the run
    cdh4 recovery tool binary agains the namenode directory? (We rather
    not upgrade yet.)

    Thanks.
    On Jun 18, 8:46 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,

    This would be a good place to use NameNode recovery. I wrote a blog
    post about it at:

    http://www.cloudera.com/blog/2012/05/namenode-recovery-tools-for-the-...

    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.

    sincerely,
    Colin

    On Jun 18, 7:36 am, Ferdy Galema wrote:






    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.
    When we start the nn from the cm, we see the following log in nn:
    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java
    :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir
    ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java
    :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag
    e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.
    java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem
    .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav
    a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav
    a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)
    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).
    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?
    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).
  • Ferdy Galema at Jun 18, 2012 at 7:20 pm
    Pretty sure that is not the case. We ran 'hadoop namenode -recover'
    while pointing to the right namenode directory. On the errors (one for
    each edits file) we had only 2 options, stop reading the edits file
    (abandon later edits) or quit altogether. Looks like we are missing
    the 'c' option... (Then again that might be due to our specific case).
    On Jun 18, 9:15 pm, Vinithra Varadharajan wrote:
    Ferdy,

    I believe this feature is available with CDH3u4. So you should be fine
    without upgrading.

    -Vinithra

    On Mon, Jun 18, 2012 at 12:09 PM, Ferdy Galema wrote:






    Do you suggest we upgrade our entire cluster or can we just the run
    cdh4 recovery tool binary agains the namenode directory? (We rather
    not upgrade yet.)
    Thanks.
    On Jun 18, 8:46 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,
    This would be a good place to use NameNode recovery.  I wrote a blog
    post about it at:
    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.
    sincerely,
    Colin
    On Jun 18, 7:36 am, Ferdy Galema wrote:

    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.
    When we start the nn from the cm, we see the following log in nn:
    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java
    :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir
    ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java
    :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag
    e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.
    java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem
    .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav
    a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav
    a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)
    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).
    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?
    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).
  • Colin Patrick McCabe at Jun 18, 2012 at 8:05 pm
    Hi Ferdy,

    In CDH3 you only get two options: stop reading the edits file at the
    current offset and quitting. CDH4 also has the "continue" option. In
    general, CDH4 has a lot of great new stuff, and this is just one
    example. This is mentioned in the blog post somewhere I believe, but I
    don't blame you for asking for clarification.

    In your specific case, it might be interesting to compare the offset
    of the first corrupt edit that it is giving you with the end of the
    file. If your corruption happened because of an out-of-disk-space
    scenario, I imagine that those two numbers should be pretty close
    together. So you shouldn't lose too many edits by snipping off the
    edit log at this point. It's best to check on this before you do it,
    of course, but that seems pretty likely to me.

    cheers,
    Colin
    Software Engineer, Cloudera

    On Jun 18, 12:20 pm, Ferdy Galema wrote:
    Pretty sure that is not the case. We ran 'hadoop namenode -recover'
    while pointing to the right namenode directory. On the errors (one for
    each edits file) we had only 2 options, stop reading the edits file
    (abandon later edits) or quit altogether. Looks like we are missing
    the 'c' option... (Then again that might be due to our specific case).

    On Jun 18, 9:15 pm, Vinithra Varadharajan wrote:






    Ferdy,
    I believe this feature is available with CDH3u4. So you should be fine
    without upgrading.
    -Vinithra
    On Mon, Jun 18, 2012 at 12:09 PM, Ferdy Galema wrote:

    Do you suggest we upgrade our entire cluster or can we just the run
    cdh4 recovery tool binary agains the namenode directory? (We rather
    not upgrade yet.)
    Thanks.
    On Jun 18, 8:46 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,
    This would be a good place to use NameNode recovery.  I wrote a blog
    post about it at:
    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.
    sincerely,
    Colin
    On Jun 18, 7:36 am, Ferdy Galema wrote:

    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.
    When we start the nn from the cm, we see the following log in nn:
    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java
    :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir
    ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java
    :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag
    e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.
    java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem
    .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav
    a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav
    a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)
    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).
    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?
    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).
  • Colin Patrick McCabe at Jun 18, 2012 at 10:09 pm
    I re-read your post more carefully, and now I see that the corruption
    is happening rather early in the edit log file.

    Unfortunately, you can't upgrade to CDH4 and use recovery mode at the
    same time. There are actually good technical reasons for this... the
    edit log format changed between CDH3 and CDH4. The new format is a
    lot easier to work with from recovery's point of view, because it
    includes things like transaction IDs.

    Do you have multiple copies of the edit log and fsimage? It's highly
    discouraged to run with no backups, for exactly this reason. If you
    do, you might try moving the known-corrupt edit log out of the way and
    seeing if it starts after that.

    I'm assuming that the secondary NN has nothing useful on it, since you
    say it has not been functioning for a month.

    When was the last fsimage written out?

    Colin
    Software Engineer, Cloudera

    On Jun 18, 1:05 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,

    In CDH3 you only get two options: stop reading the edits file at the
    current offset and quitting.  CDH4 also has the "continue" option. In
    general, CDH4 has a lot of great new stuff, and this is just one
    example. This is mentioned in the blog post somewhere I believe, but I
    don't blame you for asking for clarification.

    In your specific case, it might be interesting to compare the offset
    of the first corrupt edit that it is giving you with the end of the
    file.  If your corruption happened because of an out-of-disk-space
    scenario, I imagine that those two numbers should be pretty close
    together.  So you shouldn't lose too many edits by snipping off the
    edit log at this point.  It's best to check on this before you do it,
    of course, but that seems pretty likely to me.

    cheers,
    Colin
    Software Engineer, Cloudera

    On Jun 18, 12:20 pm, Ferdy Galema wrote:






    Pretty sure that is not the case. We ran 'hadoop namenode -recover'
    while pointing to the right namenode directory. On the errors (one for
    each edits file) we had only 2 options, stop reading the edits file
    (abandon later edits) or quit altogether. Looks like we are missing
    the 'c' option... (Then again that might be due to our specific case).
    On Jun 18, 9:15 pm, Vinithra Varadharajan <vinit...@cloudera.com>
    wrote:
    Ferdy,
    I believe this feature is available with CDH3u4. So you should be fine
    without upgrading.
    -Vinithra
    On Mon, Jun 18, 2012 at 12:09 PM, Ferdy Galema wrote:

    Do you suggest we upgrade our entire cluster or can we just the run
    cdh4 recovery tool binary agains the namenode directory? (We rather
    not upgrade yet.)
    Thanks.
    On Jun 18, 8:46 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,
    This would be a good place to use NameNode recovery.  I wrote a blog
    post about it at:
    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.
    sincerely,
    Colin
    On Jun 18, 7:36 am, Ferdy Galema wrote:

    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.
    When we start the nn from the cm, we see the following log in nn:
    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java
    :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir
    ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java
    :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag
    e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.
    java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem
    .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav
    a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav
    a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)
    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).
    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?
    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).
  • Ferdy Galema at Jun 19, 2012 at 7:40 am
    That's right, the corruption is early on in the file. This is what nn/
    current looks like:

    6965105 May 18 16:54 edits
    894554862 Jun 18 09:31 edits.new
    283050614 May 18 15:54 fsimage
    8 May 18 15:54 fstime
    114 May 18 15:54 VERSION

    The nn2 is exactly the same (we checked hashes), except that is does
    not have the edits.new file. Cutting away the entry from the log file
    makes it fail on the next. I am very curious what the cause of NPE
    exactly is. When looking in the source code it seems it fails when
    adding a certain child node:

    Line 1099 of FSDirectory:
    T addedNode = ((INodeDirectory)pathComponents[pos-1]).addChild(
    child, inheritPermission);

    pathComponents is initialized (at line 1011 as 'inodes') but an
    element of it is null. (That is the only way to throw a NPE at this
    point.) Knowing the internals will definitely help cutting away all
    faulty entries. Will look into it some more.

    Thanks for you help so far. It is greatly appreciated.

    On Jun 19, 12:09 am, Colin Patrick McCabe wrote:
    I re-read your post more carefully, and now I see that the corruption
    is happening rather early in the edit log file.

    Unfortunately, you can't upgrade to CDH4 and use recovery mode at the
    same time.  There are actually good technical reasons for this... the
    edit log format changed between CDH3 and CDH4.  The new format is a
    lot easier to work with from recovery's point of view, because it
    includes things like transaction IDs.

    Do you have multiple copies of the edit log and fsimage?  It's highly
    discouraged to run with no backups, for exactly this reason.  If you
    do, you might try moving the known-corrupt edit log out of the way and
    seeing if it starts after that.

    I'm assuming that the secondary NN has nothing useful on it, since you
    say it has not been functioning for a month.

    When was the last fsimage written out?

    Colin
    Software Engineer, Cloudera

    On Jun 18, 1:05 pm, Colin Patrick McCabe wrote:






    Hi Ferdy,
    In CDH3 you only get two options: stop reading the edits file at the
    current offset and quitting.  CDH4 also has the "continue" option. In
    general, CDH4 has a lot of great new stuff, and this is just one
    example. This is mentioned in the blog post somewhere I believe, but I
    don't blame you for asking for clarification.
    In your specific case, it might be interesting to compare the offset
    of the first corrupt edit that it is giving you with the end of the
    file.  If your corruption happened because of an out-of-disk-space
    scenario, I imagine that those two numbers should be pretty close
    together.  So you shouldn't lose too many edits by snipping off the
    edit log at this point.  It's best to check on this before you do it,
    of course, but that seems pretty likely to me.
    cheers,
    Colin
    Software Engineer, Cloudera
    On Jun 18, 12:20 pm, Ferdy Galema wrote:

    Pretty sure that is not the case. We ran 'hadoop namenode -recover'
    while pointing to the right namenode directory. On the errors (one for
    each edits file) we had only 2 options, stop reading the edits file
    (abandon later edits) or quit altogether. Looks like we are missing
    the 'c' option... (Then again that might be due to our specific case).
    On Jun 18, 9:15 pm, Vinithra Varadharajan <vinit...@cloudera.com>
    wrote:
    Ferdy,
    I believe this feature is available with CDH3u4. So you should be fine
    without upgrading.
    -Vinithra
    On Mon, Jun 18, 2012 at 12:09 PM, Ferdy Galema wrote:

    Do you suggest we upgrade our entire cluster or can we just the run
    cdh4 recovery tool binary agains the namenode directory? (We rather
    not upgrade yet.)
    Thanks.
    On Jun 18, 8:46 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,
    This would be a good place to use NameNode recovery.  I wrote a blog
    post about it at:
    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.
    sincerely,
    Colin
    On Jun 18, 7:36 am, Ferdy Galema wrote:

    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.
    When we start the nn from the cm, we see the following log in nn:
    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java
    :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir
    ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java
    :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag
    e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.
    java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem
    .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav
    a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav
    a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)
    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).
    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?
    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).
  • Ferdy Galema at Jun 19, 2012 at 2:14 pm
    I'm glad to report that it is fixed. I copied the namenode dir to a
    local machine and debugged it in a patched cdh3u4 environment. I added
    a try-catch-continue statement around FSEditLog:693 so that it prints
    the affected path and continues reading the edits file. Around 20.000
    skips were done, all of them more-or-less old disposable HBase logs. I
    guess I created my own 'continue' option. A lot of edits were saved
    judging by the log:

    2012-06-19 13:14:46,706 INFO common.Storage - Edits file /home/ferdy/
    nn/current/edits of size 6965105 edits # 63903 loaded in 1 seconds
    ...
    2012-06-19 13:16:05,980 INFO namenode.FSNamesystem - Invalid opcode,
    reached end of edit log Number of transactions found: 6874928. Bytes
    read: 893828591
    2012-06-19 13:16:05,980 INFO common.Storage - Edits file /home/ferdy/
    nn/current/edits.new of size 894554862 edits # 6874928 loaded in 79
    seconds.

    I copied the newly created image back to the cluster, removed some
    files with missing blocks (also mostly just old HBase logs) and
    everything seems to be working fine now.

    Thanks again for the responses.
    On Jun 19, 9:40 am, Ferdy Galema wrote:
    That's right, the corruption is early on in the file. This is what nn/
    current looks like:

    6965105 May 18 16:54 edits
    894554862 Jun 18 09:31 edits.new
    283050614 May 18 15:54 fsimage
    8 May 18 15:54 fstime
    114 May 18 15:54 VERSION

    The nn2 is exactly the same (we checked hashes), except that is does
    not have the edits.new file. Cutting away the entry from the log file
    makes it fail on the next. I am very curious what the cause of NPE
    exactly is. When looking in the source code it seems it fails when
    adding a certain child node:

    Line 1099 of FSDirectory:
    T addedNode = ((INodeDirectory)pathComponents[pos-1]).addChild(
    child, inheritPermission);

    pathComponents is initialized (at line 1011 as 'inodes') but an
    element of it is null. (That is the only way to throw a NPE at this
    point.) Knowing the internals will definitely help cutting away all
    faulty entries. Will look into it some more.

    Thanks for you help so far. It is greatly appreciated.

    On Jun 19, 12:09 am, Colin Patrick McCabe wrote:






    I re-read your post more carefully, and now I see that the corruption
    is happening rather early in the edit log file.
    Unfortunately, you can't upgrade to CDH4 and use recovery mode at the
    same time.  There are actually good technical reasons for this... the
    edit log format changed between CDH3 and CDH4.  The new format is a
    lot easier to work with from recovery's point of view, because it
    includes things like transaction IDs.
    Do you have multiple copies of the edit log and fsimage?  It's highly
    discouraged to run with no backups, for exactly this reason.  If you
    do, you might try moving the known-corrupt edit log out of the way and
    seeing if it starts after that.
    I'm assuming that the secondary NN has nothing useful on it, since you
    say it has not been functioning for a month.
    When was the last fsimage written out?
    Colin
    Software Engineer, Cloudera
    On Jun 18, 1:05 pm, Colin Patrick McCabe wrote:

    Hi Ferdy,
    In CDH3 you only get two options: stop reading the edits file at the
    current offset and quitting.  CDH4 also has the "continue" option. In
    general, CDH4 has a lot of great new stuff, and this is just one
    example. This is mentioned in the blog post somewhere I believe, but I
    don't blame you for asking for clarification.
    In your specific case, it might be interesting to compare the offset
    of the first corrupt edit that it is giving you with the end of the
    file.  If your corruption happened because of an out-of-disk-space
    scenario, I imagine that those two numbers should be pretty close
    together.  So you shouldn't lose too many edits by snipping off the
    edit log at this point.  It's best to check on this before you do it,
    of course, but that seems pretty likely to me.
    cheers,
    Colin
    Software Engineer, Cloudera
    On Jun 18, 12:20 pm, Ferdy Galema wrote:

    Pretty sure that is not the case. We ran 'hadoop namenode -recover'
    while pointing to the right namenode directory. On the errors (one for
    each edits file) we had only 2 options, stop reading the edits file
    (abandon later edits) or quit altogether. Looks like we are missing
    the 'c' option... (Then again that might be due to our specific case).
    On Jun 18, 9:15 pm, Vinithra Varadharajan <vinit...@cloudera.com>
    wrote:
    Ferdy,
    I believe this feature is available with CDH3u4. So you should be fine
    without upgrading.
    -Vinithra
    On Mon, Jun 18, 2012 at 12:09 PM, Ferdy Galema wrote:

    Do you suggest we upgrade our entire cluster or can we just the run
    cdh4 recovery tool binary agains the namenode directory? (We rather
    not upgrade yet.)
    Thanks.
    On Jun 18, 8:46 pm, Colin Patrick McCabe wrote:
    Hi Ferdy,
    This would be a good place to use NameNode recovery.  I wrote a blog
    post about it at:
    You should be careful to back up your fsimage and edit log directories
    before running name node recovery, of course.
    sincerely,
    Colin
    On Jun 18, 7:36 am, Ferdy Galema wrote:

    We have a corrupt edits and edits.new. Also, we did not notice that
    our nn2 was not functioning for over a month. Long story short, now we
    have a fsimage of 1 month old, a small edits (just standard 1 hour)
    and a huge edits.new. The reason of corruption is not certain, but
    probably because of the disk full at one time.
    When we start the nn from the cm, we see the following log in nn:
    2012-06-18 16:07:41,452 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files =
    1883781
    2012-06-18 16:08:05,669 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Number of files under
    construction = 23
    2012-06-18 16:08:05,674 INFO
    org.apache.hadoop.hdfs.server.common.Storage: Image file of size
    283050614 loaded in 24 seconds.
    2012-06-18 16:08:05,675 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedSetTimes: failed to setTimes /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337297156177 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,682 WARN org.apache.hadoop.hdfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /hbase/.logs/
    myServerName,60020,1333115929432-splitting/myServerName
    %3A60020.1337221541572 to /hbase/.oldlogs/myServerName
    %3A60020.1337221541572 because source does not exist
    ... SOME MORE FILES ...
    2012-06-18 16:08:05,707 ERROR
    org.apache.hadoop.hdfs.server.common.Storage: Error replaying edit log
    at offset 40548
    Recent opcode offsets: 39902 40055 40214 40367
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1099)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.jav
    a:
    1111)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java
    :
    1014)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDir
    ectory.java:
    217)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java
    :
    693)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:
    1043)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:
    853)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImag
    e.java:
    383)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.
    java:
    110)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem
    .java:
    372)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.jav
    a:
    335)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:
    271)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:
    467)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.jav
    a:
    1330)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:
    1339)
    Starting the nn in recovery mode just skips all edits from the corrupt
    entries. (Very early on in the file, so recovery like that is
    definitely not usable).
    So, can we upgrade to cdh4 to use the new recovery functions? Is it
    even possible to do that in the current state?
    As far as we know HBase and other files on the HDFS were all
    functioning healthy until we restarted the HDFS. (Putting it in the
    error state it is now in).

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupscm-users @
categorieshadoop
postedJun 18, '12 at 2:36p
activeJun 19, '12 at 2:14p
posts9
users3
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase