FAQ
The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320).

The error is this:

2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:419)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
...

This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are. It is a different exception, but seemed like a similar cause, the edits file. I tried removing a line at a time, but the error continues, only with a smaller size and edits #:

2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:419)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
...

I tried removing the edits file altogether, but that failed with: java.io.IOException: Edits file is not found

I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE:

2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)


Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway. There is a closed ticket that might be related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm using seems to already have HDFS-686 (according to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)

What do I have to do to get back up and running?

Thank you for your help,

Matthew

Search Discussions

  • Todd Lipcon at Oct 5, 2010 at 3:43 pm
    Hi Matt,

    If you want to keep your recent edits, you'll have to place an 0xFF at the
    beginning of the most recent edit entry in the edit log. It's a bit tough to
    find these boundaries, but you can try applying this patch and rebuilding:

    https://issues.apache.org/jira/browse/hdfs-1378

    This will tell you the offset of the broken entry ("recent opcodes") and you
    can put an 0xff there to tie off the file before the corrupt entry.

    -Todd

    On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux wrote:

    The namenode on an otherwise very stable HDFS cluster crashed recently.
    The filesystem filled up on the name node, which I assume is what caused
    the crash. The problem has been fixed, but I cannot get the namenode to
    restart. I am using version CDH3b2 (hadoop-0.20.2+320).

    The error is this:

    2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0
    seconds.
    2010-10-05 14:46:55,992 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends
    editing the edits file with a hex editor, but does not explain where the
    record boundaries are. It is a different exception, but seemed like a
    similar cause, the edits file. I tried removing a line at a time, but the
    error continues, only with a smaller size and edits #:

    2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0
    seconds.
    2010-10-05 14:37:16,638 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    I tried removing the edits file altogether, but that failed
    with: java.io.IOException: Edits file is not found

    I tried with a zero length edits file, so it would at least have a file
    there, but that results in an NPE:

    2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
    2010-10-05 14:52:34,776 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)


    Most if not all the files I noticed in the edits file are temporary files
    that will be deleted once this thing gets back up and running anyway.
    There is a closed ticket that might be related:
    https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm
    using seems to already have HDFS-686 (according to
    http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)

    What do I have to do to get back up and running?

    Thank you for your help,

    Matthew


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Matthew LeMieux at Oct 5, 2010 at 4:59 pm
    Thank you Todd.

    It does indeed seem like a challenge to find a record boundary, but if I wanted to do it... here is how I did it in case others are interested in doing the same.

    It looks like that value (0xFF) is referenced as OP_INVALID in the source file: [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.

    Every record begins with an op code that describes the record. The op codes are in the range [0,14] (inclusive), except for OP_INVALID. Each record type (based on op code) appears to have a different format. Additionally, it seems that the code for each record type has several code paths to support different versions of the hdfs.

    I looked in the error messages, and found the line number of the exception within the switch statement in the code (in this case, line 563). That told me that I was looking for an op code of either 0x00 or 0x09. I noticed that this particular code path had a record type that looked like this:
    [# bytes: name]

    [1:op code][4:int length][2:file system path length][?:file system path text]

    All I had to do was find a filesystem path, and look 7 bytes before it started. If the op code was a 0x00 or 0x09, then this was a candidate record.

    It would have been easier to just search for something from the error message (i.e. "12862" for me) to find candidate records, but in my case that was in almost every record. Additionally, it would have also been easier to just search for instances of the op code, but in my case one of the op codes (0x00) appears too often in the data to make that useful. If your op code is 0x03 for example, you will probably have a much easier time of it than I did.

    I was able to successfully and quickly find record boundaries and replace the op code with 0xff. After a few records I was back to the NPE exception that I was getting with a zero length edits file:

    2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 seconds.
    2010-10-05 16:47:39,671 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)

    One hurdle down, how do I get past the next one?

    (BTW, what if I didn't want to keep my recent edits, and just wanted to start up the namenode? This is currently expensive downtime; I'd rather lose a small amount of data and be up and running than continue the down time).

    Thank you for your help,

    Matthew
    On Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:

    Hi Matt,

    If you want to keep your recent edits, you'll have to place an 0xFF at the beginning of the most recent edit entry in the edit log. It's a bit tough to find these boundaries, but you can try applying this patch and rebuilding:

    https://issues.apache.org/jira/browse/hdfs-1378

    This will tell you the offset of the broken entry ("recent opcodes") and you can put an 0xff there to tie off the file before the corrupt entry.

    -Todd


    On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux wrote:
    The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320).

    The error is this:

    2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
    2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are. It is a different exception, but seemed like a similar cause, the edits file. I tried removing a line at a time, but the error continues, only with a smaller size and edits #:

    2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
    2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    I tried removing the edits file altogether, but that failed with: java.io.IOException: Edits file is not found

    I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE:

    2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
    2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)


    Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway. There is a closed ticket that might be related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm using seems to already have HDFS-686 (according to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)

    What do I have to do to get back up and running?

    Thank you for your help,

    Matthew





    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Oct 5, 2010 at 5:10 pm

    On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux wrote:

    Thank you Todd.

    It does indeed seem like a challenge to find a record boundary, but if I
    wanted to do it... here is how I did it in case others are interested in
    doing the same.

    It looks like that value (0xFF) is referenced as OP_INVALID in the source
    file:
    [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.

    Every record begins with an op code that describes the record. The op
    codes are in the range [0,14] (inclusive), except for OP_INVALID. Each
    record type (based on op code) appears to have a different format.
    Additionally, it seems that the code for each record type has several code
    paths to support different versions of the hdfs.

    I looked in the error messages, and found the line number of the exception
    within the switch statement in the code (in this case, line 563). That told
    me that I was looking for an op code of either 0x00 or 0x09. I noticed that
    this particular code path had a record type that looked like this:
    [# bytes: name]

    [1:op code][4:int length][2:file system path length][?:file system path
    text]

    All I had to do was find a filesystem path, and look 7 bytes before it
    started. If the op code was a 0x00 or 0x09, then this was a candidate
    record.

    It would have been easier to just search for something from the error
    message (i.e. "12862" for me) to find candidate records, but in my case that
    was in almost every record. Additionally, it would have also been easier to
    just search for instances of the op code, but in my case one of the op codes
    (0x00) appears too often in the data to make that useful. If your op code
    is 0x03 for example, you will probably have a much easier time of it than I
    did.

    I was able to successfully and quickly find record boundaries and replace
    the op code with 0xff. After a few records I was back to the NPE exception
    that I was getting with a zero length edits file:

    2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0
    seconds.
    2010-10-05 16:47:39,671 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)

    One hurdle down, how do I get past the next one?
    It's unclear whether you're getting the error in "edits" or "edits.new".
    From the above, I'm guessing maybe "edits" is corrupt, so when you fixed the
    error there (by truncating a few edits from the end), then the later edits
    in edits.new failed, because they depended on a path that should have been
    created by "edits".
    (BTW, what if I didn't want to keep my recent edits, and just wanted to
    start up the namenode? This is currently expensive downtime; I'd rather
    lose a small amount of data and be up and running than continue the down
    time).
    If you really want to do this, you can remove "edits.new", and replace
    "edits" with a file containing hex 0xffffffeeff I believe (edits header plus
    OP_INVALID)

    -Todd

    Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:
    Hi Matt,

    If you want to keep your recent edits, you'll have to place an 0xFF at the
    beginning of the most recent edit entry in the edit log. It's a bit tough to
    find these boundaries, but you can try applying this patch and rebuilding:

    https://issues.apache.org/jira/browse/hdfs-1378

    This will tell you the offset of the broken entry ("recent opcodes") and
    you can put an 0xff there to tie off the file before the corrupt entry.

    -Todd

    On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux wrote:

    The namenode on an otherwise very stable HDFS cluster crashed recently.
    The filesystem filled up on the name node, which I assume is what caused
    the crash. The problem has been fixed, but I cannot get the namenode to
    restart. I am using version CDH3b2 (hadoop-0.20.2+320).

    The error is this:

    2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0
    seconds.
    2010-10-05 14:46:55,992 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends
    editing the edits file with a hex editor, but does not explain where the
    record boundaries are. It is a different exception, but seemed like a
    similar cause, the edits file. I tried removing a line at a time, but the
    error continues, only with a smaller size and edits #:

    2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0
    seconds.
    2010-10-05 14:37:16,638 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    I tried removing the edits file altogether, but that failed
    with: java.io.IOException: Edits file is not found

    I tried with a zero length edits file, so it would at least have a file
    there, but that results in an NPE:

    2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage:
    Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
    2010-10-05 14:52:34,776 ERROR
    org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)


    Most if not all the files I noticed in the edits file are temporary files
    that will be deleted once this thing gets back up and running anyway.
    There is a closed ticket that might be related:
    https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm
    using seems to already have HDFS-686 (according to
    http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)

    What do I have to do to get back up and running?

    Thank you for your help,

    Matthew


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Matthew LeMieux at Oct 5, 2010 at 6:26 pm
    Thank you Ayon, Allen and Todd for your suggestions.

    I was tempted to try to find the offending records in edits.new, but opted for simply moving the file instead. I kept the recently edited edits file in place.

    The namenode started up this time with no exceptions and appears to be running well; hadoop fsck / reports a healthy filesystem.

    Thank you,

    Matthew
    On Oct 5, 2010, at 10:09 AM, Todd Lipcon wrote:

    On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux wrote:
    Thank you Todd.

    It does indeed seem like a challenge to find a record boundary, but if I wanted to do it... here is how I did it in case others are interested in doing the same.

    It looks like that value (0xFF) is referenced as OP_INVALID in the source file: [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.

    Every record begins with an op code that describes the record. The op codes are in the range [0,14] (inclusive), except for OP_INVALID. Each record type (based on op code) appears to have a different format. Additionally, it seems that the code for each record type has several code paths to support different versions of the hdfs.

    I looked in the error messages, and found the line number of the exception within the switch statement in the code (in this case, line 563). That told me that I was looking for an op code of either 0x00 or 0x09. I noticed that this particular code path had a record type that looked like this:
    [# bytes: name]

    [1:op code][4:int length][2:file system path length][?:file system path text]

    All I had to do was find a filesystem path, and look 7 bytes before it started. If the op code was a 0x00 or 0x09, then this was a candidate record.

    It would have been easier to just search for something from the error message (i.e. "12862" for me) to find candidate records, but in my case that was in almost every record. Additionally, it would have also been easier to just search for instances of the op code, but in my case one of the op codes (0x00) appears too often in the data to make that useful. If your op code is 0x03 for example, you will probably have a much easier time of it than I did.

    I was able to successfully and quickly find record boundaries and replace the op code with 0xff. After a few records I was back to the NPE exception that I was getting with a zero length edits file:

    2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 seconds.
    2010-10-05 16:47:39,671 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)

    One hurdle down, how do I get past the next one?

    It's unclear whether you're getting the error in "edits" or "edits.new". From the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error there (by truncating a few edits from the end), then the later edits in edits.new failed, because they depended on a path that should have been created by "edits".

    (BTW, what if I didn't want to keep my recent edits, and just wanted to start up the namenode? This is currently expensive downtime; I'd rather lose a small amount of data and be up and running than continue the down time).

    If you really want to do this, you can remove "edits.new", and replace "edits" with a file containing hex 0xffffffeeff I believe (edits header plus OP_INVALID)

    -Todd

    Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:
    Hi Matt,

    If you want to keep your recent edits, you'll have to place an 0xFF at the beginning of the most recent edit entry in the edit log. It's a bit tough to find these boundaries, but you can try applying this patch and rebuilding:

    https://issues.apache.org/jira/browse/hdfs-1378

    This will tell you the offset of the broken entry ("recent opcodes") and you can put an 0xff there to tie off the file before the corrupt entry.

    -Todd


    On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux wrote:
    The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320).

    The error is this:

    2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
    2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the edits file with a hex editor, but does not explain where the record boundaries are. It is a different exception, but seemed like a similar cause, the edits file. I tried removing a line at a time, but the error continues, only with a smaller size and edits #:

    2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
    2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    I tried removing the edits file altogether, but that failed with: java.io.IOException: Edits file is not found

    I tried with a zero length edits file, so it would at least have a file there, but that results in an NPE:

    2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
    2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)


    Most if not all the files I noticed in the edits file are temporary files that will be deleted once this thing gets back up and running anyway. There is a closed ticket that might be related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm using seems to already have HDFS-686 (according to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)

    What do I have to do to get back up and running?

    Thank you for your help,

    Matthew





    --
    Todd Lipcon
    Software Engineer, Cloudera



    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ayon Sinha at Oct 5, 2010 at 6:33 pm
    Hi Matthew,
    Congratulations. Having a HDFS back is quite a relief and you were lucky enough
    to not loose any files/blocks.
    Another thing I ended up doing was to decommission the namenode machine from
    being a data node. That is what had caused the namenode to get out of disk
    space.
    -Ayon





    ________________________________
    From: Matthew LeMieux <mdl@mlogiciels.com>
    To: hdfs-user@hadoop.apache.org
    Sent: Tue, October 5, 2010 11:25:57 AM
    Subject: Re: NameNode crash - cannot start dfs - need help

    Thank you Ayon, Allen and Todd for your suggestions.

    I was tempted to try to find the offending records in edits.new, but opted for
    simply moving the file instead. I kept the recently edited edits file in
    place.

    The namenode started up this time with no exceptions and appears to be running
    well; hadoop fsck / reports a healthy filesystem.

    Thank you,

    Matthew

    On Oct 5, 2010, at 10:09 AM, Todd Lipcon wrote:
    On Tue, Oct 5, 2010 at 9:58 AM, Matthew LeMieux wrote:

    Thank you Todd.

    It does indeed seem like a challenge to find a record boundary, but if I wanted
    to do it... here is how I did it in case others are interested in doing the
    same.



    It looks like that value (0xFF) is referenced as OP_INVALID in the source file:
    [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.


    Every record begins with an op code that describes the record. The op codes are
    in the range [0,14] (inclusive), except for OP_INVALID. Each record type (based
    on op code) appears to have a different format. Additionally, it seems that the
    code for each record type has several code paths to support different versions
    of the hdfs.


    I looked in the error messages, and found the line number of the exception
    within the switch statement in the code (in this case, line 563). That told me
    that I was looking for an op code of either 0x00 or 0x09. I noticed that this
    particular code path had a record type that looked like this:
    [# bytes: name]


    [1:op code][4:int length][2:file system path length][?:file system path text]


    All I had to do was find a filesystem path, and look 7 bytes before it started.
    If the op code was a 0x00 or 0x09, then this was a candidate record.


    It would have been easier to just search for something from the error message
    (i.e. "12862" for me) to find candidate records, but in my case that was in
    almost every record. Additionally, it would have also been easier to just
    search for instances of the op code, but in my case one of the op codes (0x00)
    appears too often in the data to make that useful. If your op code is 0x03 for
    example, you will probably have a much easier time of it than I did.


    I was able to successfully and quickly find record boundaries and replace the op
    code with 0xff. After a few records I was back to the NPE exception that I was
    getting with a zero length edits file:


    2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 seconds.
    2010-10-05 16:47:39,671 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)

    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)



    One hurdle down, how do I get past the next one?
    It's unclear whether you're getting the error in "edits" or "edits.new". From
    the above, I'm guessing maybe "edits" is corrupt, so when you fixed the error
    there (by truncating a few edits from the end), then the later edits in
    edits.new failed, because they depended on a path that should have been created
    by "edits".
    (BTW, what if I didn't want to keep my recent edits, and just wanted to start up
    the namenode? This is currently expensive downtime; I'd rather lose a small
    amount of data and be up and running than continue the down time).
    If you really want to do this, you can remove "edits.new", and replace "edits"
    with a file containing hex 0xffffffeeff I believe (edits header plus OP_INVALID)

    -Todd

    Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:
    Hi Matt,

    If you want to keep your recent edits, you'll have to place an 0xFF at the
    beginning of the most recent edit entry in the edit log. It's a bit tough to
    find these boundaries, but you can try applying this patch and rebuilding:


    https://issues.apache.org/jira/browse/hdfs-1378


    This will tell you the offset of the broken entry ("recent opcodes") and you can
    put an 0xff there to tie off the file before the corrupt entry.


    -Todd




    On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux wrote:

    The namenode on an otherwise very stable HDFS cluster crashed recently. The
    filesystem filled up on the name node, which I assume is what caused the crash.
    The problem has been fixed, but I cannot get the namenode to restart. I am
    using version CDH3b2 (hadoop-0.20.2+320).

    The error is this:


    2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
    2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...


    This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the
    edits file with a hex editor, but does not explain where the record boundaries
    are. It is a different exception, but seemed like a similar cause, the edits
    file. I tried removing a line at a time, but the error continues, only with a
    smaller size and edits #:


    2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
    2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...


    I tried removing the edits file altogether, but that failed
    with: java.io.IOException: Edits file is not found


    I tried with a zero length edits file, so it would at least have a file there,
    but that results in an NPE:


    2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
    2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)





    Most if not all the files I noticed in the edits file are temporary files that
    will be deleted once this thing gets back up and running anyway. There is a
    closed ticket that might be
    related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm
    using seems to already have HDFS-686 (according
    to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)


    What do I have to do to get back up and running?


    Thank you for your help,

    Matthew



    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ayon Sinha at Oct 5, 2010 at 5:18 pm
    Hi Matthew,
    "(BTW, what if I didn't want to keep my recent edits, and just wanted to start
    up the namenode? This is currently expensive downtime; I'd rather lose a small
    amount of data and be up and running than continue the down time). "
    This was exactly my use-case as well. I chose small data loss over spending
    hours on end trying to get past the exceptions.
    Try this:
    rename the 4 files under /mnt/name/current to something like *.corrupt

    then copy over the 4 files from /mnt/namesecondarynode/current
    Make sure you have enough space on the namenode box.
    Try starting the namenode. It worked for me. I was at the same place as you only
    a week ago.
    -Ayon





    ________________________________
    From: Matthew LeMieux <mdl@mlogiciels.com>
    To: hdfs-user@hadoop.apache.org
    Sent: Tue, October 5, 2010 9:58:53 AM
    Subject: Re: NameNode crash - cannot start dfs - need help

    Thank you Todd.

    It does indeed seem like a challenge to find a record boundary, but if I wanted
    to do it... here is how I did it in case others are interested in doing the
    same.


    It looks like that value (0xFF) is referenced as OP_INVALID in the source file:
    [hadoop-dist]/src//hdfs/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java.

    Every record begins with an op code that describes the record. The op codes are
    in the range [0,14] (inclusive), except for OP_INVALID. Each record type (based
    on op code) appears to have a different format. Additionally, it seems that the
    code for each record type has several code paths to support different versions
    of the hdfs.

    I looked in the error messages, and found the line number of the exception
    within the switch statement in the code (in this case, line 563). That told me
    that I was looking for an op code of either 0x00 or 0x09. I noticed that this
    particular code path had a record type that looked like this:
    [# bytes: name]

    [1:op code][4:int length][2:file system path length][?:file system path text]

    All I had to do was find a filesystem path, and look 7 bytes before it started.
    If the op code was a 0x00 or 0x09, then this was a candidate record.

    It would have been easier to just search for something from the error message
    (i.e. "12862" for me) to find candidate records, but in my case that was in
    almost every record. Additionally, it would have also been easier to just
    search for instances of the op code, but in my case one of the op codes (0x00)
    appears too often in the data to make that useful. If your op code is 0x03 for
    example, you will probably have a much easier time of it than I did.

    I was able to successfully and quickly find record boundaries and replace the op
    code with 0xff. After a few records I was back to the NPE exception that I was
    getting with a zero length edits file:

    2010-10-05 16:47:39,670 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 157037 edits # 959 loaded in 0 seconds.
    2010-10-05 16:47:39,671 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)

    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:627)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:830)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:378)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:92)


    One hurdle down, how do I get past the next one?

    (BTW, what if I didn't want to keep my recent edits, and just wanted to start up
    the namenode? This is currently expensive downtime; I'd rather lose a small
    amount of data and be up and running than continue the down time).

    Thank you for your help,

    Matthew

    On Oct 5, 2010, at 8:42 AM, Todd Lipcon wrote:

    Hi Matt,

    If you want to keep your recent edits, you'll have to place an 0xFF at the
    beginning of the most recent edit entry in the edit log. It's a bit tough to
    find these boundaries, but you can try applying this patch and rebuilding:


    https://issues.apache.org/jira/browse/hdfs-1378


    This will tell you the offset of the broken entry ("recent opcodes") and you can
    put an 0xff there to tie off the file before the corrupt entry.


    -Todd




    On Tue, Oct 5, 2010 at 8:16 AM, Matthew LeMieux wrote:

    The namenode on an otherwise very stable HDFS cluster crashed recently. The
    filesystem filled up on the name node, which I assume is what caused the crash.
    The problem has been fixed, but I cannot get the namenode to restart. I am
    using version CDH3b2 (hadoop-0.20.2+320).

    The error is this:


    2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
    2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...


    This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the
    edits file with a hex editor, but does not explain where the record boundaries
    are. It is a different exception, but seemed like a similar cause, the edits
    file. I tried removing a line at a time, but the error continues, only with a
    smaller size and edits #:


    2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
    2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...


    I tried removing the edits file altogether, but that failed
    with: java.io.IOException: Edits file is not found


    I tried with a zero length edits file, so it would at least have a file there,
    but that results in an NPE:


    2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
    2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)





    Most if not all the files I noticed in the edits file are temporary files that
    will be deleted once this thing gets back up and running anyway. There is a
    closed ticket that might be
    related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm
    using seems to already have HDFS-686 (according
    to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)


    What do I have to do to get back up and running?


    Thank you for your help,

    Matthew



    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Allen Wittenauer at Oct 5, 2010 at 4:10 pm

    On Oct 5, 2010, at 8:16 AM, Matthew LeMieux wrote:

    The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320).

    No 2nd copy of the edits file/another entry in dfs.name.dir?
  • Matthew LeMieux at Oct 5, 2010 at 5:14 pm
    No second copy. There will be from now on, but that doesn't get me out of the current hole.

    I'm not too concerned with recent edits. I'd be much happier losing recent edits if I could just get the thing to start!

    So, the question is, how do I tell the name node to just start up no matter what?

    -Matthew
    On Oct 5, 2010, at 9:10 AM, Allen Wittenauer wrote:

    On Oct 5, 2010, at 8:16 AM, Matthew LeMieux wrote:

    The namenode on an otherwise very stable HDFS cluster crashed recently. The filesystem filled up on the name node, which I assume is what caused the crash. The problem has been fixed, but I cannot get the namenode to restart. I am using version CDH3b2 (hadoop-0.20.2+320).

    No 2nd copy of the edits file/another entry in dfs.name.dir?
  • Ayon Sinha at Oct 5, 2010 at 5:19 pm
    Have you tried getting rid of the edits.new file completely (by renaming it to
    something else)?
    -Ayon





    ________________________________
    From: Matthew LeMieux <mdl@mlogiciels.com>
    To: hdfs-user@hadoop.apache.org
    Sent: Tue, October 5, 2010 10:14:21 AM
    Subject: Re: NameNode crash - cannot start dfs - need help

    No second copy. There will be from now on, but that doesn't get me out of the
    current hole.


    I'm not too concerned with recent edits. I'd be much happier losing recent
    edits if I could just get the thing to start!

    So, the question is, how do I tell the name node to just start up no matter
    what?

    -Matthew
    On Oct 5, 2010, at 9:10 AM, Allen Wittenauer wrote:

    On Oct 5, 2010, at 8:16 AM, Matthew LeMieux wrote:

    The namenode on an otherwise very stable HDFS cluster crashed recently. The
    filesystem filled up on the name node, which I assume is what caused the crash.
    The problem has been fixed, but I cannot get the namenode to restart. I am
    using version CDH3b2 (hadoop-0.20.2+320).

    No 2nd copy of the edits file/another entry in dfs.name.dir?
  • Ayon Sinha at Oct 5, 2010 at 4:20 pm
    We had almost exact problem of namenode filling up and namnode failing at this
    exact same point. Since you have created space now you can copy over the
    edits.new, fsimage and the other 2 files from your
    /mnt/namesecondarynode/current and try restarting the namenode.
    I believe you will loose some edits and probably some blocks of some files but
    we could recover most of our files.
    -Ayon





    ________________________________
    From: Matthew LeMieux <mdl@mlogiciels.com>
    To: hdfs-user@hadoop.apache.org
    Sent: Tue, October 5, 2010 8:16:15 AM
    Subject: NameNode crash - cannot start dfs - need help

    The namenode on an otherwise very stable HDFS cluster crashed recently. The
    filesystem filled up on the name node, which I assume is what caused the crash.
    The problem has been fixed, but I cannot get the namenode to restart. I am
    using version CDH3b2 (hadoop-0.20.2+320).

    The error is this:

    2010-10-05 14:46:55,989 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 157037 edits # 969 loaded in 0 seconds.
    2010-10-05 14:46:55,992 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    This page (http://wiki.apache.org/hadoop/TroubleShooting) recommends editing the
    edits file with a hex editor, but does not explain where the record boundaries
    are. It is a different exception, but seemed like a similar cause, the edits
    file. I tried removing a line at a time, but the error continues, only with a
    smaller size and edits #:

    2010-10-05 14:37:16,635 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 156663 edits # 966 loaded in 0 seconds.
    2010-10-05 14:37:16,638 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NumberFormatException: For input string: "12862^@^@^@^@^@^@^@^@"
    at
    java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:419)
    at java.lang.Long.parseLong(Long.java:468)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.readLong(FSEditLog.java:1355)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:563)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1022)
    ...

    I tried removing the edits file altogether, but that failed
    with: java.io.IOException: Edits file is not found

    I tried with a zero length edits file, so it would at least have a file there,
    but that results in an NPE:

    2010-10-05 14:52:34,775 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits
    file /mnt/name/current/edits of size 0 edits # 0 loaded in 0 seconds.
    2010-10-05 14:52:34,776 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode:
    java.lang.NullPointerException
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1081)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1093)

    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:996)
    at
    org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:199)



    Most if not all the files I noticed in the edits file are temporary files that
    will be deleted once this thing gets back up and running anyway. There is a
    closed ticket that might be
    related: https://issues.apache.org/jira/browse/HDFS-686 , but the version I'm
    using seems to already have HDFS-686 (according
    to http://archive.cloudera.com/cdh/3/hadoop-0.20.2+320/changes.html)

    What do I have to do to get back up and running?

    Thank you for your help,

    Matthew

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedOct 5, '10 at 3:16p
activeOct 5, '10 at 6:33p
posts11
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase