FAQ
Hi folks,



Would be grateful if someone can help us understand why our secondary
namenodes don't seem to be doing anything:



1. running 0.13.0
2. secondary namenode logs continuously spew:



at org.apache.hadoop.ipc.Client.call(Client.java:471)

at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)

at org.apache.hadoop.dfs.$Proxy0.rollEditLog(Unknown Source)

at
org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.j
ava:252)

at
org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:181)

at java.lang.Thread.run(Thread.java:595)

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Attempt to
roll edit log but edits.new exists

at
org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)

at
org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3239)

at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)

at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)

at java.lang.reflect.Method.invoke(Method.java:585)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)



3. Primary namenode spews similar errors:

java.io.IOException: Attempt to roll edit log but edits.new exists

at
org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)

at
org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3239)

at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)

at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)

at java.lang.reflect.Method.invoke(Method.java:585)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)



4. No current, fsimage etc. in secondary namenode:



//primary:

[jssarma@hadoop001 hadoop]$ pwd

/var/hadoop

[jssarma@hadoop001 hadoop]$ ls

current image in_use.lock logs tmp

[jssarma@hadoop001 hadoop]$ ls current

edits edits.new fsimage fstime VERSION



//secondary:

[root@hadoop072 hadoop]# pwd

/var/hadoop

[root@hadoop072 hadoop]# ls

logs tmp



help appreciated,



Joydeep

Search Discussions

  • Raghu Angadi at Aug 24, 2007 at 4:45 am
    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.

    Raghu.

    Joydeep Sen Sarma wrote:
    Hi folks,



    Would be grateful if someone can help us understand why our secondary
    namenodes don't seem to be doing anything:



    1. running 0.13.0
    2. secondary namenode logs continuously spew:



    at org.apache.hadoop.ipc.Client.call(Client.java:471)

    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)

    at org.apache.hadoop.dfs.$Proxy0.rollEditLog(Unknown Source)

    at
    org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.j
    ava:252)

    at
    org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:181)

    at java.lang.Thread.run(Thread.java:595)

    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Attempt to
    roll edit log but edits.new exists

    at
    org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)

    at
    org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3239)

    at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)

    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)

    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
    Impl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:585)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)



    3. Primary namenode spews similar errors:

    java.io.IOException: Attempt to roll edit log but edits.new exists

    at
    org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)

    at
    org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3239)

    at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)

    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)

    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
    Impl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:585)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)



    4. No current, fsimage etc. in secondary namenode:



    //primary:

    [jssarma@hadoop001 hadoop]$ pwd

    /var/hadoop

    [jssarma@hadoop001 hadoop]$ ls

    current image in_use.lock logs tmp

    [jssarma@hadoop001 hadoop]$ ls current

    edits edits.new fsimage fstime VERSION



    //secondary:

    [root@hadoop072 hadoop]# pwd

    /var/hadoop

    [root@hadoop072 hadoop]# ls

    logs tmp



    help appreciated,



    Joydeep






  • Joydeep Sen Sarma at Aug 24, 2007 at 7:57 am
    Thanks for replying.

    Can you please clarify - is it the case that the secondary namenode
    stuff only works in 0.13.1? and what's the connection with replication
    factor?

    We lost the file system completely once, trying to make sure we can
    avoid it the next time.

    Joydeep

    -----Original Message-----
    From: Raghu Angadi
    Sent: Thursday, August 23, 2007 9:44 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors


    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.

    Raghu.

    Joydeep Sen Sarma wrote:
    Hi folks,



    Would be grateful if someone can help us understand why our secondary
    namenodes don't seem to be doing anything:



    1. running 0.13.0
    2. secondary namenode logs continuously spew:



    at org.apache.hadoop.ipc.Client.call(Client.java:471)

    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)

    at org.apache.hadoop.dfs.$Proxy0.rollEditLog(Unknown Source)

    at
    org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.j
    ava:252)

    at
    org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:181)
    at java.lang.Thread.run(Thread.java:595)

    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Attempt to
    roll edit log but edits.new exists

    at
    org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)

    at
    org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3239)

    at
    org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
    Impl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:585)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)



    3. Primary namenode spews similar errors:

    java.io.IOException: Attempt to roll edit log but edits.new exists

    at
    org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:467)

    at
    org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3239)

    at
    org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:544)
    at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
    Impl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:585)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:341)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:573)



    4. No current, fsimage etc. in secondary namenode:



    //primary:

    [jssarma@hadoop001 hadoop]$ pwd

    /var/hadoop

    [jssarma@hadoop001 hadoop]$ ls

    current image in_use.lock logs tmp

    [jssarma@hadoop001 hadoop]$ ls current

    edits edits.new fsimage fstime VERSION



    //secondary:

    [root@hadoop072 hadoop]# pwd

    /var/hadoop

    [root@hadoop072 hadoop]# ls

    logs tmp



    help appreciated,



    Joydeep






  • Raghu Angadi at Aug 24, 2007 at 3:16 pm

    Joydeep Sen Sarma wrote:
    Thanks for replying.

    Can you please clarify - is it the case that the secondary namenode
    stuff only works in 0.13.1? and what's the connection with replication
    factor?

    We lost the file system completely once, trying to make sure we can
    avoid it the next time.
    I am not sure if the problem you reported still exists in 0.13.1. You
    might still have the problem and you can ask again. But you should move
    to 0.13.1 since it has some critical fixes. See release notes for 0.13.1
    or HADOOP-1603. You should always upgrade to the latest minor release
    version when moving to next major version.

    Raghu.
    Joydeep

    -----Original Message-----
    From: Raghu Angadi
    Sent: Thursday, August 23, 2007 9:44 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors


    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.

    Raghu.

    Joydeep Sen Sarma wrote:
    Hi folks,
  • Koji Noguchi at Aug 24, 2007 at 3:29 pm
    Joydeep,

    I think you're hitting this bug.
    http://issues.apache.org/jira/browse/HADOOP-1076

    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.

    Koji




    Raghu Angadi wrote:
    Joydeep Sen Sarma wrote:
    Thanks for replying.

    Can you please clarify - is it the case that the secondary namenode
    stuff only works in 0.13.1? and what's the connection with replication
    factor?

    We lost the file system completely once, trying to make sure we can
    avoid it the next time.
    I am not sure if the problem you reported still exists in 0.13.1. You
    might still have the problem and you can ask again. But you should
    move to 0.13.1 since it has some critical fixes. See release notes for
    0.13.1 or HADOOP-1603. You should always upgrade to the latest minor
    release version when moving to next major version.

    Raghu.
    Joydeep

    -----Original Message-----
    From: Raghu Angadi Sent: Thursday,
    August 23, 2007 9:44 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors


    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.

    Raghu.

    Joydeep Sen Sarma wrote:
    Hi folks,
  • Joydeep Sen Sarma at Aug 25, 2007 at 2:19 am
    I wish I had read the bug more carefully - thought that the issue was
    fixed in 0.13.1.

    Of course not, the issue persists. Meanwhile - half the files are
    corrupted after the upgrade (followed the upgrade wiki, tried to restore
    to backed up metadata and old version - to no avail).

    Sigh - have a nice weekend everyone,

    Joydeep

    -----Original Message-----
    From: Koji Noguchi
    Sent: Friday, August 24, 2007 8:29 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors

    Joydeep,

    I think you're hitting this bug.
    http://issues.apache.org/jira/browse/HADOOP-1076

    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.

    Koji




    Raghu Angadi wrote:
    Joydeep Sen Sarma wrote:
    Thanks for replying.

    Can you please clarify - is it the case that the secondary namenode
    stuff only works in 0.13.1? and what's the connection with
    replication
    factor?

    We lost the file system completely once, trying to make sure we can
    avoid it the next time.
    I am not sure if the problem you reported still exists in 0.13.1. You
    might still have the problem and you can ask again. But you should
    move to 0.13.1 since it has some critical fixes. See release notes for
    0.13.1 or HADOOP-1603. You should always upgrade to the latest minor
    release version when moving to next major version.

    Raghu.
    Joydeep

    -----Original Message-----
    From: Raghu Angadi Sent: Thursday,
    August 23, 2007 9:44 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors


    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.

    Raghu.

    Joydeep Sen Sarma wrote:
    Hi folks,
  • Ted Dunning at Aug 25, 2007 at 2:43 am
    Damn.

    I hope you have a nice weekend (anyway).

    On 8/24/07 7:20 PM, "Joydeep Sen Sarma" wrote:

    Sigh - have a nice weekend everyone,
  • Joydeep Sen Sarma at Aug 25, 2007 at 6:23 am
    Just in case someone's curious.



    Stop and restart dfs with 0.13.1:



    - master name node says:



    2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode up
    at: hadoop001.sf2p.facebook.com/10.16.159.101:9000

    2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because

    it does not exist

    2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest

    ination exists

    2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
    /user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be

    cause destination exists

    2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest

    ination exists

    2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
    /user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be

    cause destination exists

    2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest

    ination exists

    ...



    there's a serious blast of these (replaying edit log?). In any case -
    after this is done - it enters safemode - presume the fs is corrupted by
    then. At the exact same time - the datanodes are busy deleting blocks!:



    2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
    DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren

    t'}

    2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
    BLOCKREPORT_INTERVAL of 3588023msec

    2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir18/blk_-9223045762536565560

    2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir12/blk_-9214178286744587840

    2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir20/blk_-9213127144044535407

    2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir26/blk_-9211625398030978419

    2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir24/blk_-9189558923884323865

    2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir10/blk_-9115468136273900585





    ouch - I guess those are all the blocks that fsck is now reporting
    missing. Known bug? Operator error? (well - I did do a clean shutdown
    ..)





    -----Original Message-----
    From: Joydeep Sen Sarma
    Sent: Friday, August 24, 2007 7:21 PM
    To: hadoop-user@lucene.apache.org
    Subject: RE: secondary namenode errors



    I wish I had read the bug more carefully - thought that the issue was

    fixed in 0.13.1.



    Of course not, the issue persists. Meanwhile - half the files are

    corrupted after the upgrade (followed the upgrade wiki, tried to restore

    to backed up metadata and old version - to no avail).



    Sigh - have a nice weekend everyone,



    Joydeep



    -----Original Message-----

    From: Koji Noguchi

    Sent: Friday, August 24, 2007 8:29 AM

    To: hadoop-user@lucene.apache.org

    Subject: Re: secondary namenode errors



    Joydeep,



    I think you're hitting this bug.

    http://issues.apache.org/jira/browse/HADOOP-1076



    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.



    Koji









    Raghu Angadi wrote:
    Joydeep Sen Sarma wrote:
    Thanks for replying.
    >>
    Can you please clarify - is it the case that the secondary namenode
    stuff only works in 0.13.1? and what's the connection with
    replication
    factor?
    >>
    We lost the file system completely once, trying to make sure we can
    avoid it the next time.
    >
    I am not sure if the problem you reported still exists in 0.13.1. You
    might still have the problem and you can ask again. But you should
    move to 0.13.1 since it has some critical fixes. See release notes for
    0.13.1 or HADOOP-1603. You should always upgrade to the latest minor
    release version when moving to next major version. >
    Raghu.
    >
    Joydeep
    >>
    -----Original Message-----
    From: Raghu Angadi Sent: Thursday,
    August 23, 2007 9:44 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors
    >>

    >>
    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.
    >>
    Raghu.
    >>
    Joydeep Sen Sarma wrote:
    Hi folks,
  • Konstantin Shvachko at Aug 28, 2007 at 8:39 pm
    Could you please describe what is exactly the problem with upgrade.
    If malfunctioned secondary name-node messes up with the image and/or
    edits files
    then we should fix the problem asap.

    Thanks,
    Konstantin

    Joydeep Sen Sarma wrote:
    Just in case someone's curious.



    Stop and restart dfs with 0.13.1:



    - master name node says:



    2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode up
    at: hadoop001.sf2p.facebook.com/10.16.159.101:9000

    2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because

    it does not exist

    2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest

    ination exists

    2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
    /user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be

    cause destination exists

    2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest

    ination exists

    2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
    /user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be

    cause destination exists

    2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest

    ination exists

    ...



    there's a serious blast of these (replaying edit log?). In any case -
    after this is done - it enters safemode - presume the fs is corrupted by
    then. At the exact same time - the datanodes are busy deleting blocks!:



    2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
    DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren

    t'}

    2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
    BLOCKREPORT_INTERVAL of 3588023msec

    2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir18/blk_-9223045762536565560

    2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir12/blk_-9214178286744587840

    2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir20/blk_-9213127144044535407

    2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir26/blk_-9211625398030978419

    2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir24/blk_-9189558923884323865

    2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir10/blk_-9115468136273900585





    ouch - I guess those are all the blocks that fsck is now reporting
    missing. Known bug? Operator error? (well - I did do a clean shutdown
    ..)





    -----Original Message-----
    From: Joydeep Sen Sarma
    Sent: Friday, August 24, 2007 7:21 PM
    To: hadoop-user@lucene.apache.org
    Subject: RE: secondary namenode errors



    I wish I had read the bug more carefully - thought that the issue was

    fixed in 0.13.1.



    Of course not, the issue persists. Meanwhile - half the files are

    corrupted after the upgrade (followed the upgrade wiki, tried to restore

    to backed up metadata and old version - to no avail).



    Sigh - have a nice weekend everyone,



    Joydeep



    -----Original Message-----

    From: Koji Noguchi

    Sent: Friday, August 24, 2007 8:29 AM

    To: hadoop-user@lucene.apache.org

    Subject: Re: secondary namenode errors



    Joydeep,



    I think you're hitting this bug.

    http://issues.apache.org/jira/browse/HADOOP-1076



    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.



    Koji









    Raghu Angadi wrote:


    Joydeep Sen Sarma wrote:
    Thanks for replying.




    Can you please clarify - is it the case that the secondary namenode
    stuff only works in 0.13.1? and what's the connection with
    replication


    factor?




    We lost the file system completely once, trying to make sure we can
    avoid it the next time.




    I am not sure if the problem you reported still exists in 0.13.1. You

    might still have the problem and you can ask again. But you should

    move to 0.13.1 since it has some critical fixes. See release notes for



    0.13.1 or HADOOP-1603. You should always upgrade to the latest minor

    release version when moving to next major version.




    Raghu.




    Joydeep




    -----Original Message-----
    From: Raghu Angadi Sent: Thursday,
    August 23, 2007 9:44 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors







    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.




    Raghu.




    Joydeep Sen Sarma wrote:
    Hi folks,



  • Joydeep Sen Sarma at Aug 28, 2007 at 10:08 pm
    I don't think the secondary namenodes are working throughout - so not
    sure they are a factor.

    What I observed:

    - stopped dfs. took backup copy of current/ directory
    - restarted dfs with new 0.13.1
    - after file system is back up - fsck says fs is corrupt. Large number
    of files have blocks missing
    - tried to stop, copy back backedup copy of current/ and restart with
    0.13.0 - no effect - same corruption remains.

    The logs attached before show what happened - as soon as 0.13.1 was
    bought up:

    - Primary Namenode seems to have gone through the edit log (I am
    guessing). It spews out large number of errors saying 'cannot rename xxx
    to yyy, yyy already exists').
    - All the datanodes show large number of 'deleting block bbb' errors.

    Without knowing the source, the logs would indicate that the primary
    went through the edit log and somehow managed to instruct the datanodes
    to delete blocks which caused corruption.

    Would be happy to send more logs ur way (although the pattern remains
    the same)if it helps,

    Thanks,

    Joydeep




    -----Original Message-----
    From: Konstantin Shvachko
    Sent: Tuesday, August 28, 2007 1:39 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors

    Could you please describe what is exactly the problem with upgrade.
    If malfunctioned secondary name-node messes up with the image and/or
    edits files
    then we should fix the problem asap.

    Thanks,
    Konstantin

    Joydeep Sen Sarma wrote:
    Just in case someone's curious.



    Stop and restart dfs with 0.13.1:



    - master name node says:



    2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode: Namenode up
    at: hadoop001.sf2p.facebook.com/10.16.159.101:9000

    2007-08-24 18:31:28,560 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedDelete: failed to remove /tmp/pu3 because

    it does not exist

    2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-00044 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00044 because dest

    ination exists

    2007-08-24 18:31:28,571 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00044.crc to
    /user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc be

    cause destination exists

    2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-00040 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00040 because dest

    ination exists

    2007-08-24 18:31:28,572 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00040.crc to
    /user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc be

    cause destination exists

    2007-08-24 18:31:28,573 WARN org.apache.hadoop.dfs.StateChange: DIR*
    FSDirectory.unprotectedRenameTo: failed to rename /user/facebook

    /chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-00052 to
    /user/facebook/chatter/rawcounts/2007-08-04/part-00052 because dest

    ination exists

    ...



    there's a serious blast of these (replaying edit log?). In any case -
    after this is done - it enters safemode - presume the fs is corrupted by
    then. At the exact same time - the datanodes are busy deleting blocks!:



    2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: Starting
    DataNode in: FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren

    t'}

    2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode: using
    BLOCKREPORT_INTERVAL of 3588023msec

    2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9223045762536565560 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir18/blk_-9223045762536565560

    2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9214178286744587840 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir12/blk_-9214178286744587840

    2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9213127144044535407 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir20/blk_-9213127144044535407

    2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9211625398030978419 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir26/blk_-9211625398030978419

    2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9189558923884323865 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir14/subdir24/blk_-9189558923884323865

    2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode: Deleting
    block blk_-9115468136273900585 file /var/hadoop/tmp/dfs/data/cu

    rrent/subdir10/blk_-9115468136273900585





    ouch - I guess those are all the blocks that fsck is now reporting
    missing. Known bug? Operator error? (well - I did do a clean shutdown
    ..)





    -----Original Message-----
    From: Joydeep Sen Sarma
    Sent: Friday, August 24, 2007 7:21 PM
    To: hadoop-user@lucene.apache.org
    Subject: RE: secondary namenode errors



    I wish I had read the bug more carefully - thought that the issue was

    fixed in 0.13.1.



    Of course not, the issue persists. Meanwhile - half the files are

    corrupted after the upgrade (followed the upgrade wiki, tried to restore
    to backed up metadata and old version - to no avail).



    Sigh - have a nice weekend everyone,



    Joydeep



    -----Original Message-----

    From: Koji Noguchi

    Sent: Friday, August 24, 2007 8:29 AM

    To: hadoop-user@lucene.apache.org

    Subject: Re: secondary namenode errors



    Joydeep,



    I think you're hitting this bug.

    http://issues.apache.org/jira/browse/HADOOP-1076



    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.



    Koji









    Raghu Angadi wrote:


    Joydeep Sen Sarma wrote:
    Thanks for replying.




    Can you please clarify - is it the case that the secondary namenode
    stuff only works in 0.13.1? and what's the connection with
    replication


    factor?




    We lost the file system completely once, trying to make sure we can
    avoid it the next time.




    I am not sure if the problem you reported still exists in 0.13.1. You

    might still have the problem and you can ask again. But you should

    move to 0.13.1 since it has some critical fixes. See release notes for



    0.13.1 or HADOOP-1603. You should always upgrade to the latest minor

    release version when moving to next major version.




    Raghu.




    Joydeep




    -----Original Message-----
    From: Raghu Angadi Sent: Thursday,
    August 23, 2007 9:44 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors







    On a related note, please don't use 0.13.0, use the latest released
    version for 0.13 (I think it is 0.13.1). If the secondary namenode
    actually works, then it will resulting all the replications set to 1.




    Raghu.




    Joydeep Sen Sarma wrote:
    Hi folks,



  • ChaoChun Liang at Sep 4, 2007 at 8:12 am

    Koji Noguchi wrote:

    Joydeep,

    I think you're hitting this bug.
    http://issues.apache.org/jira/browse/HADOOP-1076

    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.

    Koji
    Hi,
    I met the similar problem under 0.14.0 when I increase the slave nodes
    from 2 to 3.
    All tasks work fine in 2-nodes cluster, but will hang in reduce process
    in the 3-nodes cluster.
    My logs show as following for each *.secondarynamenode.* logs.

    2007-09-04 16:02:33,390 ERROR org.apache.hadoop.dfs.NameNode.Secondary:
    Exception in doCheckpoint:
    2007-09-04 16:02:33,390 ERROR org.apache.hadoop.dfs.NameNode.Secondary:
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Attempt to roll
    edit log but edits.new exists
    at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:582)
    at org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3510)
    at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:553)
    at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566)

    at org.apache.hadoop.ipc.Client.call(Client.java:470)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
    at org.apache.hadoop.dfs.$Proxy0.rollEditLog(Unknown Source)
    at
    org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:261)
    at org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:190)
    at java.lang.Thread.run(Thread.java:619)

    Any suggestion to solve?

    ChaoChun
    --
    View this message in context: http://www.nabble.com/secondary-namenode-errors-tf4321013.html#a12473770
    Sent from the Hadoop Users mailing list archive at Nabble.com.
  • Dhruba Borthakur at Sep 4, 2007 at 4:42 pm
    I have already uploaded a patch for HADOOP-1076. If somebody can review this
    patch, it can be committed.

    Thanks,
    dhruba

    -----Original Message-----
    From: ChaoChun Liang
    Sent: Tuesday, September 04, 2007 1:12 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors



    Koji Noguchi wrote:
    Joydeep,

    I think you're hitting this bug.
    http://issues.apache.org/jira/browse/HADOOP-1076

    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.

    Koji
    Hi,
    I met the similar problem under 0.14.0 when I increase the slave nodes
    from 2 to 3.
    All tasks work fine in 2-nodes cluster, but will hang in reduce process
    in the 3-nodes cluster.
    My logs show as following for each *.secondarynamenode.* logs.

    2007-09-04 16:02:33,390 ERROR org.apache.hadoop.dfs.NameNode.Secondary:
    Exception in doCheckpoint:
    2007-09-04 16:02:33,390 ERROR org.apache.hadoop.dfs.NameNode.Secondary:
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Attempt to roll
    edit log but edits.new exists
    at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:582)
    at
    org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3510)
    at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:553)
    at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
    .java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566)

    at org.apache.hadoop.ipc.Client.call(Client.java:470)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
    at org.apache.hadoop.dfs.$Proxy0.rollEditLog(Unknown Source)
    at
    org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:
    261)
    at
    org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:190)
    at java.lang.Thread.run(Thread.java:619)

    Any suggestion to solve?

    ChaoChun
    --
    View this message in context:
    http://www.nabble.com/secondary-namenode-errors-tf4321013.html#a12473770
    Sent from the Hadoop Users mailing list archive at Nabble.com.
  • ChaoChun Liang at Sep 5, 2007 at 5:47 am
    This patch solved the problems from the secondary name node,
    which also provided more clear information to debug the reduce hanged
    problem
    in my environment. The hanged problem comes from the firewall, which blocks
    the connection, and all tasks back to normal now.

    Thanks.
    ChaoChun


    Dhruba Borthakur wrote:
    I have already uploaded a patch for HADOOP-1076. If somebody can review
    this
    patch, it can be committed.

    Thanks,
    dhruba

    -----Original Message-----
    From: ChaoChun Liang
    Sent: Tuesday, September 04, 2007 1:12 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: secondary namenode errors



    Koji Noguchi wrote:
    Joydeep,

    I think you're hitting this bug.
    http://issues.apache.org/jira/browse/HADOOP-1076

    In any case, as Raghu suggested, please use 0.13.1 and not 0.13.

    Koji
    Hi,
    I met the similar problem under 0.14.0 when I increase the slave nodes
    from 2 to 3.
    All tasks work fine in 2-nodes cluster, but will hang in reduce process
    in the 3-nodes cluster.
    My logs show as following for each *.secondarynamenode.* logs.

    2007-09-04 16:02:33,390 ERROR org.apache.hadoop.dfs.NameNode.Secondary:
    Exception in doCheckpoint:
    2007-09-04 16:02:33,390 ERROR org.apache.hadoop.dfs.NameNode.Secondary:
    org.apache.hadoop.ipc.RemoteException: java.io.IOException: Attempt to
    roll
    edit log but edits.new exists
    at org.apache.hadoop.dfs.FSEditLog.rollEditLog(FSEditLog.java:582)
    at
    org.apache.hadoop.dfs.FSNamesystem.rollEditLog(FSNamesystem.java:3510)
    at org.apache.hadoop.dfs.NameNode.rollEditLog(NameNode.java:553)
    at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
    .java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:340)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:566)

    at org.apache.hadoop.ipc.Client.call(Client.java:470)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
    at org.apache.hadoop.dfs.$Proxy0.rollEditLog(Unknown Source)
    at
    org.apache.hadoop.dfs.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:
    261)
    at
    org.apache.hadoop.dfs.SecondaryNameNode.run(SecondaryNameNode.java:190)
    at java.lang.Thread.run(Thread.java:619)

    Any suggestion to solve?

    ChaoChun
    --
    View this message in context:
    http://www.nabble.com/secondary-namenode-errors-tf4321013.html#a12473770
    Sent from the Hadoop Users mailing list archive at Nabble.com.


    --
    View this message in context: http://www.nabble.com/secondary-namenode-errors-tf4321013.html#a12492019
    Sent from the Hadoop Users mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 24, '07 at 1:33a
activeSep 5, '07 at 5:47a
posts13
users7
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase