FAQ
To elaborate a bit on what Marcin said:

* Once a node starts to believe that a few other nodes are down, it seems
to stay that way for a very long time (hours). I'm not even sure it will
recover without a restart.
* I've tried to stop then start gossip with nodetool on the node that
thinks several other nodes is down. Did not help.
* nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for
all nodes (including the ones marked as down in status output)
* It is quite possible that the problem starts at the time of day when we
have a lot of bulkloading going on. But why does it then stay for several
hours after the load goes down?
* I have the feeling this started with our upgrade from 1.2.18 to 2.0.12
about a month ago, but I have no hard data to back that up.

Regarding region/snitch - this is not an AWS deployment, we run on our own
datacenter with GossipingPropertyFileSnitch.

Right now I have this situation with one node (04-05) thinking that there
are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all
nodes are up. Load on cluster right now is minimal, there's no GC going on.
Heap usage is approximately 3.5/6Gb.

root@cssa04-05:~# nodetool status|grep DN
DN 2001:4c28:1:413:0:1:2:5 1.07 TB 256 1.8%
114ff46e-57d0-40dd-87fb-3e4259e96c16 rack2
DN 2001:4c28:1:413:0:1:2:6 1.06 TB 256 1.8%
b161a6f3-b940-4bba-9aa3-cfb0fc1fe759 rack2
DN 2001:4c28:1:413:0:1:2:13 896.82 GB 256 1.6%
4a488366-0db9-4887-b538-4c5048a6d756 rack2
DN 2001:4c28:1:413:0:1:3:7 1.04 TB 256 1.8%
95cf2cdb-d364-4b30-9b91-df4c37f3d670 rack3

Excerpt from nodetool gossipinfo showing one node that status thinks is
down (2:5) and one that status thinks is up (3:12):

/2001:4c28:1:413:0:1:2:5
   generation:1427712750
   heartbeat:2310212
   NET_VERSION:7
   RPC_ADDRESS:0.0.0.0
   RELEASE_VERSION:2.0.13
   RACK:rack2
   LOAD:1.172524771195E12
   INTERNAL_IP:2001:4c28:1:413:0:1:2:5
   HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16
   DC:iceland
   SEVERITY:0.0
   STATUS:NORMAL,100493381707736523347375230104768602825
   SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda
/2001:4c28:1:413:0:1:3:12
   generation:1427714889
   heartbeat:2305710
   NET_VERSION:7
   RPC_ADDRESS:0.0.0.0
   RELEASE_VERSION:2.0.13
   RACK:rack3
   LOAD:1.047542503234E12
   INTERNAL_IP:2001:4c28:1:413:0:1:3:12
   HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00
   DC:iceland
   SEVERITY:0.0
   STATUS:NORMAL,100163259989151698942931348962560111256
   SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda

I also tried disablegossip + enablegossip on 02-05 to see if that made
04-05 mark it as up, with no success.

Please let me know what other debug information I can provide.

Regards,
\EF
On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle wrote:

Do you happen to be using a tool like Nagios or Ganglia that are able to
report utilization (CPU, Load, disk io, network)? There are plugins for
both that will also notify you of (depending on whether you enabled the
intermediate GC logging) about what is happening.


On Thu, Apr 2, 2015 at 8:35 AM, Jan wrote:

Marcin ;

are all your nodes within the same Region ?
If not in the same region, what is the Snitch type that you are using
?

Jan/



On Thursday, April 2, 2015 3:28 AM, Michal Michalski <
michal.michalski@boxever.com> wrote:


Hey Marcin,

Are they actually going up and down repeatedly (flapping) or just down
and they never come back?
There might be different reasons for flapping nodes, but to list what I
have at the top of my head right now:

1. Network issues. I don't think it's your case, but you can read about
the issues some people are having when deploying C* on AWS EC2 (keyword to
look for: phi_convict_threshold)

2. Heavy load. Node is under heavy load because of massive number of
reads / writes / bulkloads or e.g. unthrottled compaction etc., which may
result in extensive GC.

Could any of these be a problem in your case? I'd start from
investigating GC logs e.g. to see how long does the "stop the world" full
GC take (GC logs should be on by default from what I can see [1])

[1] https://issues.apache.org/jira/browse/CASSANDRA-5319

Michał


Kind regards,
Michał Michalski,
michal.michalski@boxever.com

On 2 April 2015 at 11:05, Marcin Pietraszek <mpietraszek@opera.com>
wrote:

Hi!

We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch
installed. Assume we have nodes A, B, C, D, E. On some irregular basis
one of those nodes starts to report that subset of other nodes is in
DN state although C* deamon on all nodes is running:

A$ nodetool status
UN B
DN C
DN D
UN E

B$ nodetool status
UN A
UN C
UN D
UN E

C$ nodetool status
DN A
UN B
UN D
UN E

After restart of A node, C and D report that A it's in UN and also A
claims that whole cluster is in UN state. Right now I don't have any
clear steps to reproduce that situation, do you guys have any idea
what could be causing such behaviour? How this could be prevented?

It seems like when A node is a coordinator and gets request for some
data being replicated on C and D it respond with Unavailable
exception, after restarting A that problem disapears.

--
mp



Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 5 | next ›
Discussion Overview
groupuser @
categoriescassandra
postedApr 2, '15 at 10:06a
activeApr 8, '15 at 7:02a
posts5
users5
websitecassandra.apache.org
irc#cassandra

People

Translate

site design / logo © 2022 Grokbase