FAQ
Hi,

we are running CDH4.1 for a couple of month, with HA enabled for
quorum-based storage. Suddenly the cluster is in bad health,

here is the log initially

"The reported blocks 3132015 has reached the threshold 0.9990 of total
blocks 3135151. Safe mode will be turned off automatically in 29 seconds."

after a couple mins, both namenode shutting down.

here is part of the log,

FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed
for required journal (JournalAndStream(mgr=QJM to [192.168.x.x:8485,
192.168.x.x:8485, 192.168.x.x:8485], stream=QuorumOutputStream starting at
txid 22877548))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many
exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.x.x:8485: IPC's epoch 19 is less than the last promised epoch 20

SHUTDOWN_MSG: Shutting down NameNode at x.x.x.x

any one can help?

To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.

Search Discussions

  • Rex Zhen at Dec 4, 2013 at 10:07 am
    it is getting better after i increased the hip size in namenode from
    1g(default) to 4G.

    still get warning "The DataNode has 1,678,630 blocks. Warning threshold:
    200,000 block(s)<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/126/advicePopup?timestamp=1386151485571&currentMode=true&healthTestName=DATA_NODE_BLOCK_COUNT>
    "

    and i can see the block map is keeping update.

    Is that normal? or i can increase the warning threshold in the config?
    On Wednesday, December 4, 2013 1:13:41 AM UTC-8, Rex Zhen wrote:

    Hi,

    we are running CDH4.1 for a couple of month, with HA enabled for
    quorum-based storage. Suddenly the cluster is in bad health,

    here is the log initially

    "The reported blocks 3132015 has reached the threshold 0.9990 of total
    blocks 3135151. Safe mode will be turned off automatically in 29 seconds."

    after a couple mins, both namenode shutting down.

    here is part of the log,

    FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush
    failed for required journal (JournalAndStream(mgr=QJM to [192.168.x.x:8485,
    192.168.x.x:8485, 192.168.x.x:8485], stream=QuorumOutputStream starting at
    txid 22877548))
    org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many
    exceptions to achieve quorum size 2/3. 3 exceptions thrown:
    192.168.x.x:8485: IPC's epoch 19 is less than the last promised epoch 20

    SHUTDOWN_MSG: Shutting down NameNode at x.x.x.x

    any one can help?
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Mark Schnegelberger at Dec 4, 2013 at 9:34 pm
    Hi Rex,

    Clicking on that health check will give you additional detail around what
    this health check does, and why we check it: "This is a DataNode health
    check that checks for whether the DataNode has too many blocks. Having too
    many blocks on a DataNode may affect the DataNode's performance, and an
    increasing block count may require additional heap space to prevent long
    garbage collection pauses. This test can be configured using the *DataNode
    Block Count Thresholds* DataNode monitoring setting."

    In your case, you have at least one Datanode with 1.6M (!) blocks. Cloudera
    Manager notifies you of this so you can take action. As to why this node
    has such a high block count, perhaps you're writing a lot of very tiny
    files or could aggregate them in some other fashion. While you *could* just
    increase the threshold to stop notifying you of this health alert, you may
    wish to delve deeper into the rationale for so many blocks.

    --
    Mark S.

    On Wed, Dec 4, 2013 at 5:07 AM, Rex Zhen wrote:

    it is getting better after i increased the hip size in namenode from
    1g(default) to 4G.

    still get warning "The DataNode has 1,678,630 blocks. Warning threshold:
    200,000 block(s)<http://nn-01-sc.nim.com:7180/cmf/services/31/instances/126/advicePopup?timestamp=1386151485571&currentMode=true&healthTestName=DATA_NODE_BLOCK_COUNT>
    "

    and i can see the block map is keeping update.

    Is that normal? or i can increase the warning threshold in the config?

    On Wednesday, December 4, 2013 1:13:41 AM UTC-8, Rex Zhen wrote:

    Hi,

    we are running CDH4.1 for a couple of month, with HA enabled for
    quorum-based storage. Suddenly the cluster is in bad health,

    here is the log initially

    "The reported blocks 3132015 has reached the threshold 0.9990 of total
    blocks 3135151. Safe mode will be turned off automatically in 29 seconds."

    after a couple mins, both namenode shutting down.

    here is part of the log,

    FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush
    failed for required journal (JournalAndStream(mgr=QJM to [192.168.x.x:8485,
    192.168.x.x:8485, 192.168.x.x:8485], stream=QuorumOutputStream starting at
    txid 22877548))
    org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many
    exceptions to achieve quorum size 2/3. 3 exceptions thrown:
    192.168.x.x:8485: IPC's epoch 19 is less than the last promised epoch 20

    SHUTDOWN_MSG: Shutting down NameNode at x.x.x.x

    any one can help?
    To unsubscribe from this group and stop receiving emails from it, send an
    email to scm-users+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.
  • Todd Grayson at Dec 4, 2013 at 11:51 pm
    I suggest review of the discussion here:

    http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
    On 12/4/13 2:34 PM, Mark Schnegelberger wrote:
    Hi Rex,

    Clicking on that health check will give you additional detail around
    what this health check does, and why we check it: "This is a DataNode
    health check that checks for whether the DataNode has too many blocks.
    Having too many blocks on a DataNode may affect the DataNode's
    performance, and an increasing block count may require additional heap
    space to prevent long garbage collection pauses. This test can be
    configured using the *DataNode Block Count Thresholds* DataNode
    monitoring setting."

    In your case, you have at least one Datanode with 1.6M (!) blocks.
    Cloudera Manager notifies you of this so you can take action. As to
    why this node has such a high block count, perhaps you're writing a
    lot of very tiny files or could aggregate them in some other fashion.
    While you *could* just increase the threshold to stop notifying you of
    this health alert, you may wish to delve deeper into the rationale for
    so many blocks.

    --
    Mark S.


    On Wed, Dec 4, 2013 at 5:07 AM, Rex Zhen wrote:

    it is getting better after i increased the hip size in namenode
    from 1g(default) to 4G.

    still get warning "The DataNode has 1,678,630 blocks. Warning
    threshold: 200,000 block(s)
    <http://nn-01-sc.nim.com:7180/cmf/services/31/instances/126/advicePopup?timestamp=1386151485571&currentMode=true&healthTestName=DATA_NODE_BLOCK_COUNT>"

    and i can see the block map is keeping update.

    Is that normal? or i can increase the warning threshold in the
    config?


    On Wednesday, December 4, 2013 1:13:41 AM UTC-8, Rex Zhen wrote:

    Hi,

    we are running CDH4.1 for a couple of month, with HA enabled
    for quorum-based storage. Suddenly the cluster is in bad health,

    here is the log initially

    "The reported blocks 3132015 has reached the threshold 0.9990
    of total blocks 3135151. Safe mode will be turned off
    automatically in 29 seconds."

    after a couple mins, both namenode shutting down.

    here is part of the log,

    FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error:
    flush failed for required journal (JournalAndStream(mgr=QJM to
    [192.168.x.x:8485, 192.168.x.x:8485, 192.168.x.x:8485],
    stream=QuorumOutputStream starting at txid 22877548))
    org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got
    too many exceptions to achieve quorum size 2/3. 3 exceptions
    thrown:
    192.168.x.x:8485: IPC's epoch 19 is less than the last
    promised epoch 20

    SHUTDOWN_MSG: Shutting down NameNode at x.x.x.x

    any one can help?

    To unsubscribe from this group and stop receiving emails from it,
    send an email to scm-users+unsubscribe@cloudera.org


    To unsubscribe from this group and stop receiving emails from it, send
    an email to scm-users+unsubscribe@cloudera.org.

    --
    Todd Grayson
    Cloudera Support
    Customer Operations Engineering
    tgrayson@cloudera.com

    To unsubscribe from this group and stop receiving emails from it, send an email to scm-users+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupscm-users @
categorieshadoop
postedDec 4, '13 at 9:13a
activeDec 4, '13 at 11:51p
posts4
users3
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase