We regularly get warnings like these on our cloudera manager cluster
(hbase):
The health test result for REGION_SERVER_GC_DURATION has become bad:
Average time spent in garbage collection was 41,387 ms per minute over
the previous 5 minute(s). Critical threshold: 60 ms.
That in itself is not difficult to understand, although I think 60ms
per minute garbage collection should actually not trigger any alarms,
that is only 0.001 percent!
However, once I look at the configuration of these thresholds, I am
totally confused. The "HBase Master Garbage Collection Duration
Thresholds" configuration is:
warning: 30%
critical: 60%
With the explanation that this is "a percentage of elapsed wall clock time".
I agree, spending 30% to 60% of the time in a minute in garbage
collection, that would indeed be concerning. But that doesn't seem to
be what is triggering above warnings.
Am I misunderstanding something in this configuration here? Or is it
misinterpreted as ms in stead of percentage by cloudera manager?
Thanks in advance,
Jan