Have a 3 node cluster, node 2 and 3 went down due to OOM, but node 1 survived, clients could push new messages but non were delivered, node 1 had plenty of memory left so no blocking were (or at least shouldn't have been) in action due to that.


I then tried to bring node 2 and 3 back online by simply restarting them, this is what happened:


Node1 floods the logs for a while at a rate of 20-100/sec:
=ERROR REPORT==== 18-Mar-2013::07:10:40 ===
Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)


Start up node 3
Floods
=ERROR REPORT==== 18-Mar-2013::08:23:15 ===
Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
and is stuck at
"starting exchange, queue and binding recovery ..."


rabbitmqctl status hangs for ever on node 1


Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and rabbitmqctl status and never returns


node 2 then runs out of memory again, without client connections this time:
=INFO REPORT==== 18-Mar-2013::09:09:35 ===
vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
=WARNING REPORT==== 18-Mar-2013::09:09:35 ===
memory resource limit alarm set on node rabbit at tiger02


Querying /api/overview at node1 gives:
{error,{error,{badmatch,false},
[{rabbit_mgmt_wm_overview,version,1},
{rabbit_mgmt_wm_overview,to_json,2},
{webmachine_resource,resource_call,3},
{webmachine_resource,do,3},
{webmachine_decision_core,resource_call,1},
{webmachine_decision_core,decision,1},
{webmachine_decision_core,handle_request,2},
{rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}


node 3 starts eventually.
kills node 2, starts again, stops at "starting database ?"
nothing in the log or startup_err, cpu usage 0%
kills after 30min and starts again, same thing.


node 3 can now output rabbitmqctl status, node 1 still cannot.
node 1 can't be shutdown, force kills
with node1 down, node 2 now comes pass "starting database" and starts
neither node 2 or node 3 responds to rabbitmqctl status
shutting down node 2, but doesn't respond, have to do kill -9
node 3 still doesn't respond to rabbitmqctl status
shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.


note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.


Starting up node3, log now gets flooded with:
=ERROR REPORT==== 18-Mar-2013::11:09:04 ===
** Generic server <0.629.0> terminating
** Last message in was {init,<0.182.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}


but comes online eventually and can do "rabbitmqctl status"


starts up node2, also reports a lot of:
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.640.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],
undefined,[]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],undefined,[]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.645.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},
{<0.32304.1>,<0.30804.0>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},{<0.32304.1>,<0.30804.0>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}


node 2 comes online i can now query rabbitmqctl status
starting up node 1, comes online
the cluster is now working again but several durables queues are gone(!)

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 7 | next ›
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedMar 19, '13 at 3:41a
activeMar 25, '13 at 7:35a
posts7
users4
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase