We having problems with one node (out of 56 in total) misbehaving.
Symptoms are:

* High number of full CMS old space collections during early morning
when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
thrift insertions.
* Really long stop-the-world GC events (I've seen up to 50 seconds) for
both CMS and ParNew.
* CPU usage higher during early morning hours compared to other nodes.
* The large number of Garbage Collections *seems* to correspond to doing
a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
small ones)
* Node loosing track of what other nodes are up and keeping that state
until restart (this I think is a bug caused by the GC behaviour, with
the stop-the-world making the node not accepting gossip connections from
other nodes)

This is on 2.0.13 with vnodes (256 per node).

All other nodes have normal behaviour, with a few (2-3) full CMS old
space in the same 3h period that the trouble node is making some 30
ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
problem was even worse (it seems, this is a bit hard to debug as it
happens *almost* every night).

nodetool status shows that although we have a certain unbalance in the
cluster, this node is neither the most nor the least loaded. I.e. we
have between 1.6% and 2.1% in the "Owns" column, and the troublesome
node reports 1.7%.

All nodes are under puppet control, so configuration is the same

We're running NetworkTopolyStrategy with rack awareness, and here's a
deviation from recommended settings - we have slightly varying number of
nodes in the racks:

      15 cssa01
      15 cssa02
      13 cssa03
      13 cssa04

The affected node is in the cssa04 rack. Could this mean I have some
kind of hotspot situation? Why would that show up as more GC work?

I'm quite puzzled here, so I'm looking for hints on how to identify what
is causing this.


Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 2 | next ›
Discussion Overview
groupuser @
postedApr 15, '15 at 12:16p
activeApr 15, '15 at 12:46p

2 users in discussion

Michal Michalski: 1 post Erik Forsberg: 1 post



site design / logo © 2022 Grokbase