We having problems with one node (out of 56 in total) misbehaving.
* High number of full CMS old space collections during early morning
when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
* Really long stop-the-world GC events (I've seen up to 50 seconds) for
both CMS and ParNew.
* CPU usage higher during early morning hours compared to other nodes.
* The large number of Garbage Collections *seems* to correspond to doing
a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
* Node loosing track of what other nodes are up and keeping that state
until restart (this I think is a bug caused by the GC behaviour, with
the stop-the-world making the node not accepting gossip connections from
This is on 2.0.13 with vnodes (256 per node).
All other nodes have normal behaviour, with a few (2-3) full CMS old
space in the same 3h period that the trouble node is making some 30
ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
problem was even worse (it seems, this is a bit hard to debug as it
happens *almost* every night).
nodetool status shows that although we have a certain unbalance in the
cluster, this node is neither the most nor the least loaded. I.e. we
have between 1.6% and 2.1% in the "Owns" column, and the troublesome
node reports 1.7%.
All nodes are under puppet control, so configuration is the same
We're running NetworkTopolyStrategy with rack awareness, and here's a
deviation from recommended settings - we have slightly varying number of
nodes in the racks:
The affected node is in the cssa04 rack. Could this mean I have some
kind of hotspot situation? Why would that show up as more GC work?
I'm quite puzzled here, so I'm looking for hints on how to identify what
is causing this.