FAQ
We regularly get warnings like these on our cloudera manager cluster
(hbase):

The health test result for REGION_SERVER_GC_DURATION has become bad:
Average time spent in garbage collection was 41,387 ms per minute over
the previous 5 minute(s). Critical threshold: 60 ms.

That in itself is not difficult to understand, although I think 60ms
per minute garbage collection should actually not trigger any alarms,
that is only 0.001 percent!

However, once I look at the configuration of these thresholds, I am
totally confused. The "HBase Master Garbage Collection Duration
Thresholds" configuration is:

warning: 30%
critical: 60%

With the explanation that this is "a percentage of elapsed wall clock time".

I agree, spending 30% to 60% of the time in a minute in garbage
collection, that would indeed be concerning. But that doesn't seem to
be what is triggering above warnings.

Am I misunderstanding something in this configuration here? Or is it
misinterpreted as ms in stead of percentage by cloudera manager?

Thanks in advance,
Jan

Search Discussions

  • Chris Leroy at Aug 6, 2013 at 3:49 pm
    Jan,

    There is indeed an ugly bug in the output of this check. It ought to
    read that it's spending on avergage 41.3 seconds per minute in GC
    (68.83%) over the last 5 minutes and that the critical threshold for
    alerting is 60% of time.

    I'll file a bug to fix this up.

    chris
    On Tue, Aug 6, 2013 at 4:33 AM, Jan Van Besien wrote:
    We regularly get warnings like these on our cloudera manager cluster
    (hbase):

    The health test result for REGION_SERVER_GC_DURATION has become bad:
    Average time spent in garbage collection was 41,387 ms per minute over
    the previous 5 minute(s). Critical threshold: 60 ms.

    That in itself is not difficult to understand, although I think 60ms
    per minute garbage collection should actually not trigger any alarms,
    that is only 0.001 percent!

    However, once I look at the configuration of these thresholds, I am
    totally confused. The "HBase Master Garbage Collection Duration
    Thresholds" configuration is:

    warning: 30%
    critical: 60%

    With the explanation that this is "a percentage of elapsed wall clock time".

    I agree, spending 30% to 60% of the time in a minute in garbage
    collection, that would indeed be concerning. But that doesn't seem to
    be what is triggering above warnings.

    Am I misunderstanding something in this configuration here? Or is it
    misinterpreted as ms in stead of percentage by cloudera manager?

    Thanks in advance,
    Jan
  • Jan Van Besien at Aug 6, 2013 at 6:54 pm

    On Tue, Aug 6, 2013 at 5:49 PM, Chris Leroy wrote:
    There is indeed an ugly bug in the output of this check. It ought to
    read that it's spending on avergage 41.3 seconds per minute in GC
    (68.83%) over the last 5 minutes and that the critical threshold for
    alerting is 60% of time.
    Thanks. I'll enable gc logging tomorrow because it sounds like we do
    have a serious GC/memory issue here then ;-)

    Jan
  • Jan Van Besien at Aug 7, 2013 at 9:08 am
    Hi Chris,
    On Tue, Aug 6, 2013 at 8:54 PM, Jan Van Besien wrote:
    Thanks. I'll enable gc logging tomorrow because it sounds like we do
    have a serious GC/memory issue here then ;-)
    I investigated the gc behavior in more detail today, using
    "-verbose:gc -XX:+PrintGCDetails."

    I now think that there is actually something wrong with the check
    itself as well, not only with the output.

    In attachment is an example verbosegc log file, for which cloudera
    manager generated GC duration alarms.

    A crude analyses of the file indicates 137 seconds was spent in GC for
    a total duration of 591 seconds. That is about 23%, which is indeed
    very high.

    However a more detailed analyses shows the vast majority of individual
    gc durations is actually less than 15ms. Also, there are "concurrent
    mark and sweep" events roughly every 4 seconds, with an average
    duration of 0.002 seconds and "parallel scavenge" events roughly every
    0.03 seconds with an average duration of 0.007 seconds. That all
    sounds very acceptable to me.

    You should know that the nodes running our region servers have 32 cpu
    cores (or 16 with hyperthreading). So what I think is wrong with the
    check is that it compares cpu time with wall clock time, without
    taking into account the total available cpu time.

    I believe the check should actually do 137/(591*32), which is only
    0.7% of time spend in GC.

    Does this make sense?

    Thanks,
    Jan
  • Chris Leroy at Aug 8, 2013 at 5:53 pm
    You are correct. The check is based on the hadoop jvm metric
    gcTimeMillis which itself is just passing through
    GarbageCollectorMXBean's get collection time. You're right that this
    leaves it to the collector implementation to handle whether we're
    talking wallclock or CPU time, and as such is close to useless for
    some collectors. We've got an internal bug to improve things here.
    I'll bump the priority.

    In the meantime -- do you know how to turn off the check that's
    causing your problems? It's available via a monitoring configuration
    paramter: [service] -> configuration -> view and edit -> monitoring:
    filter on "gc duration" and you can set the appropriate thresholds to
    never, never to disable the check.

    chris
    On Wed, Aug 7, 2013 at 2:08 AM, Jan Van Besien wrote:
    Hi Chris,
    On Tue, Aug 6, 2013 at 8:54 PM, Jan Van Besien wrote:
    Thanks. I'll enable gc logging tomorrow because it sounds like we do
    have a serious GC/memory issue here then ;-)
    I investigated the gc behavior in more detail today, using
    "-verbose:gc -XX:+PrintGCDetails."

    I now think that there is actually something wrong with the check
    itself as well, not only with the output.

    In attachment is an example verbosegc log file, for which cloudera
    manager generated GC duration alarms.

    A crude analyses of the file indicates 137 seconds was spent in GC for
    a total duration of 591 seconds. That is about 23%, which is indeed
    very high.

    However a more detailed analyses shows the vast majority of individual
    gc durations is actually less than 15ms. Also, there are "concurrent
    mark and sweep" events roughly every 4 seconds, with an average
    duration of 0.002 seconds and "parallel scavenge" events roughly every
    0.03 seconds with an average duration of 0.007 seconds. That all
    sounds very acceptable to me.

    You should know that the nodes running our region servers have 32 cpu
    cores (or 16 with hyperthreading). So what I think is wrong with the
    check is that it compares cpu time with wall clock time, without
    taking into account the total available cpu time.

    I believe the check should actually do 137/(591*32), which is only
    0.7% of time spend in GC.

    Does this make sense?

    Thanks,
    Jan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupscm-users @
categorieshadoop
postedAug 6, '13 at 11:33a
activeAug 8, '13 at 5:53p
posts5
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Jan Van Besien: 3 posts Chris Leroy: 2 posts

People

Translate

site design / logo © 2022 Grokbase