I have this problem where 3 of my 84 nodes misbehave with too long GC
times, leading to them being marked as DN.
This happens when I load data to them using CQL from a hadoop job, so
quite a lot of inserts at a time. The CQL loading job is using
TokenAwarePolicy with fallback to DCAwareRoundRobinPolicy. Cassandra
java driver version 188.8.131.52 is in use.
My other observation is that around the time the GC starts to work like
crazy, there is a lot of outbound network traffic from the troublesome
nodes. If a healthy node has around 25 Mbit/s in, 25 Mbit/s out, an
unhealthy sees 25 Mbit/s in, 200 Mbit/s out.
So, something is iffy with these 3 nodes, but I have some trouble
finding out exactly what makes them differ.
This is Cassandra 2.0.13 (yes, old) using vnodes. Keyspace is using
NetworkTopologyStrategy with replication 2, in one datacenter.
One thing I know I'm doing wrong is that I have slightly differing
number of hosts in each of my 6 chassies (One of them have 15 nodes, one
of have 13, the remaining have 14). Could what I'm seeing here be the
effect of that?
Other ideas on what could be wrong? Some kind of vnode imbalance? How
can I diagnose that? What metrics should I be looking at?