FAQ
Hi!

I have this problem where 3 of my 84 nodes misbehave with too long GC
times, leading to them being marked as DN.

This happens when I load data to them using CQL from a hadoop job, so
quite a lot of inserts at a time. The CQL loading job is using
TokenAwarePolicy with fallback to DCAwareRoundRobinPolicy. Cassandra
java driver version 2.1.7.1 is in use.

My other observation is that around the time the GC starts to work like
crazy, there is a lot of outbound network traffic from the troublesome
nodes. If a healthy node has around 25 Mbit/s in, 25 Mbit/s out, an
unhealthy sees 25 Mbit/s in, 200 Mbit/s out.

So, something is iffy with these 3 nodes, but I have some trouble
finding out exactly what makes them differ.

This is Cassandra 2.0.13 (yes, old) using vnodes. Keyspace is using
NetworkTopologyStrategy with replication 2, in one datacenter.

One thing I know I'm doing wrong is that I have slightly differing
number of hosts in each of my 6 chassies (One of them have 15 nodes, one
of have 13, the remaining have 14). Could what I'm seeing here be the
effect of that?

Other ideas on what could be wrong? Some kind of vnode imbalance? How
can I diagnose that? What metrics should I be looking at?

Thanks,
\EF

Search Discussions

  • Sai krishnam raju potturi at Apr 19, 2016 at 1:55 pm
    hi;
        do we see any hung process like Repairs on those 3 nodes? what does
    "nodetool netstats" show??

    thanks
    Sai
    On Tue, Apr 19, 2016 at 8:24 AM, Erik Forsberg wrote:

    Hi!

    I have this problem where 3 of my 84 nodes misbehave with too long GC
    times, leading to them being marked as DN.

    This happens when I load data to them using CQL from a hadoop job, so
    quite a lot of inserts at a time. The CQL loading job is using
    TokenAwarePolicy with fallback to DCAwareRoundRobinPolicy. Cassandra java
    driver version 2.1.7.1 is in use.

    My other observation is that around the time the GC starts to work like
    crazy, there is a lot of outbound network traffic from the troublesome
    nodes. If a healthy node has around 25 Mbit/s in, 25 Mbit/s out, an
    unhealthy sees 25 Mbit/s in, 200 Mbit/s out.

    So, something is iffy with these 3 nodes, but I have some trouble finding
    out exactly what makes them differ.

    This is Cassandra 2.0.13 (yes, old) using vnodes. Keyspace is using
    NetworkTopologyStrategy with replication 2, in one datacenter.

    One thing I know I'm doing wrong is that I have slightly differing number
    of hosts in each of my 6 chassies (One of them have 15 nodes, one of have
    13, the remaining have 14). Could what I'm seeing here be the effect of
    that?

    Other ideas on what could be wrong? Some kind of vnode imbalance? How can
    I diagnose that? What metrics should I be looking at?

    Thanks,
    \EF

  • Patrick McFadin at Apr 20, 2016 at 6:47 pm
    Can you show the output of a tpstats on one of the effected nodes? That
    will give some indication where the trouble might be.

    Patrick
    On Tue, Apr 19, 2016 at 6:54 AM, sai krishnam raju potturi wrote:

    hi;
    do we see any hung process like Repairs on those 3 nodes? what does
    "nodetool netstats" show??

    thanks
    Sai
    On Tue, Apr 19, 2016 at 8:24 AM, Erik Forsberg wrote:

    Hi!

    I have this problem where 3 of my 84 nodes misbehave with too long GC
    times, leading to them being marked as DN.

    This happens when I load data to them using CQL from a hadoop job, so
    quite a lot of inserts at a time. The CQL loading job is using
    TokenAwarePolicy with fallback to DCAwareRoundRobinPolicy. Cassandra java
    driver version 2.1.7.1 is in use.

    My other observation is that around the time the GC starts to work like
    crazy, there is a lot of outbound network traffic from the troublesome
    nodes. If a healthy node has around 25 Mbit/s in, 25 Mbit/s out, an
    unhealthy sees 25 Mbit/s in, 200 Mbit/s out.

    So, something is iffy with these 3 nodes, but I have some trouble finding
    out exactly what makes them differ.

    This is Cassandra 2.0.13 (yes, old) using vnodes. Keyspace is using
    NetworkTopologyStrategy with replication 2, in one datacenter.

    One thing I know I'm doing wrong is that I have slightly differing number
    of hosts in each of my 6 chassies (One of them have 15 nodes, one of have
    13, the remaining have 14). Could what I'm seeing here be the effect of
    that?

    Other ideas on what could be wrong? Some kind of vnode imbalance? How can
    I diagnose that? What metrics should I be looking at?

    Thanks,
    \EF

  • Erik Forsberg at Apr 21, 2016 at 12:21 pm

    On 2016-04-19 15:54, sai krishnam raju potturi wrote:
    hi;
    do we see any hung process like Repairs on those 3 nodes? what
    does "nodetool netstats" show??
    No hung process from what I can see.

    root@cssa02-06:~# nodetool tpstats
    Pool Name Active Pending Completed Blocked
    All time blocked
    ReadStage 0 0 1530227
    0 0
    RequestResponseStage 0 0 19230947
    0 0
    MutationStage 0 0 37059234
    0 0
    ReadRepairStage 0 0 80178
    0 0
    ReplicateOnWriteStage 0 0 0
    0 0
    GossipStage 0 0 43003
    0 0
    CacheCleanupExecutor 0 0 0
    0 0
    MigrationStage 0 0 0
    0 0
    MemoryMeter 0 0 267
    0 0
    FlushWriter 0 0 202
    0 5
    ValidationExecutor 0 0 212
    0 0
    InternalResponseStage 0 0 0
    0 0
    AntiEntropyStage 0 0 427
    0 0
    MemtablePostFlusher 0 0 669
    0 0
    MiscStage 0 0 212
    0 0
    PendingRangeCalculator 0 0 70
    0 0
    CompactionExecutor 0 0 1206
    0 0
    commitlog_archiver 0 0 0
    0 0
    HintedHandoff 0 1 113
    0 0

    Message type Dropped
    RANGE_SLICE 1
    READ_REPAIR 0
    PAGED_RANGE 0
    BINARY 0
    READ 219
    MUTATION 3
    _TRACE 0
    REQUEST_RESPONSE 2
    COUNTER_MUTATION 0

    root@cssa02-06:~# nodetool netstats
    Mode: NORMAL
    Not sending any streams.
    Read Repair Statistics:
    Attempted: 75317
    Mismatch (Blocking): 0
    Mismatch (Background): 11
    Pool Name Active Pending Completed
    Commands n/a 1 19248846
    Responses n/a 0 19875699

    \EF

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriescassandra
postedApr 19, '16 at 12:24p
activeApr 21, '16 at 12:21p
posts4
users3
websitecassandra.apache.org
irc#cassandra

People

Translate

site design / logo © 2022 Grokbase