FAQ
Today we observed a namenode crash caused by jounaling exceptions, the
failover mechanism is solid and standby namenode automatically took over.
My question is why active namenode should exit on a journalling error...I
thought active namenode is doing journalling in async way and shouldn't be
affected by journaling exceptions.

If anyone else has similar issues before please kindly share with me how to
prevent this from happening...we'll upgrade to CDH4.2 shortly, will 4.2 fix
this issue?

Thanks
Ken

2013-05-17 06:26:13,155 FATAL
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log
segment 12236969 failed for required journal (JournalAndS
tream(mgr=QJM to [master:8485, slave1:8485, slave2:8485], stream=null))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
respond.
         at
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
         at
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:387)
         at
org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:91)
         at
org.apache.hadoop.hdfs.server.namenode.JournalSet$2.apply(JournalSet.java:199)
         at
org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:350)
         at
org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:196)
         at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:918)
         at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:887)
         at
org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1013)
         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4436)
         at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:734)
         at
org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:129)
         at
org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:8762)
         at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:396)
         at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
2013-05-17 06:26:13,161 INFO org.apache.hadoop.util.ExitUtil: Exiting with
status 1
2013-05-17 06:26:13,166 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode
************************************************************/

--

Search Discussions

  • Aaron T. Myers at May 17, 2013 at 6:38 pm
    Hi Ken,

    The NN will not return success to any client before it has durably logged
    its edit, and in the event an edit cannot be durably logged the NN will
    indeed shut itself down. In this case for some reason the NN could not
    write some edit to a majority of JNs within the timeout period, and so it
    correctly shut itself down.

    I recommend you look in the JN logs, or perhaps higher up in the NN logs,
    to see if you can find some explanation for why the JNs could not log their
    edits promptly.


    --
    Aaron T. Myers
    Software Engineer, Cloudera

    On Fri, May 17, 2013 at 7:59 AM, ken deng wrote:

    Today we observed a namenode crash caused by jounaling exceptions, the
    failover mechanism is solid and standby namenode automatically took over.
    My question is why active namenode should exit on a journalling error...I
    thought active namenode is doing journalling in async way and shouldn't be
    affected by journaling exceptions.

    If anyone else has similar issues before please kindly share with me how
    to prevent this from happening...we'll upgrade to CDH4.2 shortly, will 4.2
    fix this issue?

    Thanks
    Ken

    2013-05-17 06:26:13,155 FATAL
    org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: starting log
    segment 12236969 failed for required journal (JournalAndS
    tream(mgr=QJM to [master:8485, slave1:8485, slave2:8485], stream=null))
    java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to
    respond.
    at
    org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
    at
    org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.startLogSegment(QuorumJournalManager.java:387)
    at
    org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalAndStream.startLogSegment(JournalSet.java:91)
    at
    org.apache.hadoop.hdfs.server.namenode.JournalSet$2.apply(JournalSet.java:199)
    at
    org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:350)
    at
    org.apache.hadoop.hdfs.server.namenode.JournalSet.startLogSegment(JournalSet.java:196)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.startLogSegment(FSEditLog.java:918)
    at
    org.apache.hadoop.hdfs.server.namenode.FSEditLog.rollEditLog(FSEditLog.java:887)
    at
    org.apache.hadoop.hdfs.server.namenode.FSImage.rollEditLog(FSImage.java:1013)
    at
    org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4436)
    at
    org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:734)
    at
    org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:129)
    at
    org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:8762)
    at
    org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
    org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
    2013-05-17 06:26:13,161 INFO org.apache.hadoop.util.ExitUtil: Exiting with
    status 1
    2013-05-17 06:26:13,166 INFO
    org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode
    ************************************************************/

    --


    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcdh-user @
categorieshadoop
postedMay 17, '13 at 2:59p
activeMay 17, '13 at 6:38p
posts2
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Ken deng: 1 post Aaron T. Myers: 1 post

People

Translate

site design / logo © 2022 Grokbase