Have a 3 node cluster, node 2 and 3 went down due to OOM, but node 1 survived, clients could push new messages but non were delivered, node 1 had plenty of memory left so no blocking were (or at least shouldn't have been) in action due to that.


I then tried to bring node 2 and 3 back online by simply restarting them, this is what happened:


Node1 floods the logs for a while at a rate of 20-100/sec:
=ERROR REPORT==== 18-Mar-2013::07:10:40 ===
Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)


Start up node 3
Floods
=ERROR REPORT==== 18-Mar-2013::08:23:15 ===
Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
and is stuck at
"starting exchange, queue and binding recovery ..."


rabbitmqctl status hangs for ever on node 1


Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and rabbitmqctl status and never returns


node 2 then runs out of memory again, without client connections this time:
=INFO REPORT==== 18-Mar-2013::09:09:35 ===
vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
=WARNING REPORT==== 18-Mar-2013::09:09:35 ===
memory resource limit alarm set on node rabbit at tiger02


Querying /api/overview at node1 gives:
{error,{error,{badmatch,false},
[{rabbit_mgmt_wm_overview,version,1},
{rabbit_mgmt_wm_overview,to_json,2},
{webmachine_resource,resource_call,3},
{webmachine_resource,do,3},
{webmachine_decision_core,resource_call,1},
{webmachine_decision_core,decision,1},
{webmachine_decision_core,handle_request,2},
{rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}


node 3 starts eventually.
kills node 2, starts again, stops at "starting database ?"
nothing in the log or startup_err, cpu usage 0%
kills after 30min and starts again, same thing.


node 3 can now output rabbitmqctl status, node 1 still cannot.
node 1 can't be shutdown, force kills
with node1 down, node 2 now comes pass "starting database" and starts
neither node 2 or node 3 responds to rabbitmqctl status
shutting down node 2, but doesn't respond, have to do kill -9
node 3 still doesn't respond to rabbitmqctl status
shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.


note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.


Starting up node3, log now gets flooded with:
=ERROR REPORT==== 18-Mar-2013::11:09:04 ===
** Generic server <0.629.0> terminating
** Last message in was {init,<0.182.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.9934227485209703">>},
true,true,<0.21310.24>,[],<0.629.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6868.7071.0>,<6868.7070.0>},
{<6867.19845.80>,<6867.19844.80>},
{<0.21601.24>,<0.21548.24>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}


but comes online eventually and can do "rabbitmqctl status"


starts up node2, also reports a lot of:
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.640.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],
undefined,[]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.1019297200255096">>},
true,true,<0.977.11>,[],<0.640.0>,[],[],undefined,[]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}
=ERROR REPORT==== 18-Mar-2013::11:11:06 ===
** Generic server <0.645.0> terminating
** Last message in was {init,<0.152.0>}
** When Server state == {q,{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},
{<0.32304.1>,<0.30804.0>}]},
none,false,undefined,undefined,
{[],[]},
undefined,undefined,undefined,undefined,
{state,fine,5000,undefined},
{0,nil},
undefined,undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
1,
{{0,nil},{0,nil}},
undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,undefined}
** Reason for termination ==
** {'module could not be loaded',
[{undefined,init,
[{amqqueue,
{resource,<<"vhost1">>,queue,
<<"tmp_topic-0.8794151877518743">>},
true,true,<0.30538.0>,[],<0.645.0>,[],[],
[{vhost,<<"vhost1">>},
{name,<<"HA">>},
{pattern,<<".*">>},
{definition,[{<<"ha-mode">>,<<"all">>}]},
{priority,0}],
[{<6872.28270.5>,<6872.28269.5>},{<0.32304.1>,<0.30804.0>}]},
true,#Fun<rabbit_amqqueue_process.5.64830354>]},
{rabbit_amqqueue_process,handle_call,3},
{gen_server2,handle_msg,2},
{proc_lib,wake_up,3}]}


node 2 comes online i can now query rabbitmqctl status
starting up node 1, comes online
the cluster is now working again but several durables queues are gone(!)

Search Discussions

  • Tim Watson at Mar 20, 2013 at 4:25 pm
    Hi Carl,


    What version of rabbit are you running? A number of bugs pertaining to the 'Discarding message ... in an old incarnation .. of this node' were fixed in recent(ish) releases.


    Cheers,
    Tim


    On 19 Mar 2013, at 03:41, Carl H?rberg wrote:

    Have a 3 node cluster, node 2 and 3 went down due to OOM, but node 1 survived, clients could push new messages but non were delivered, node 1 had plenty of memory left so no blocking were (or at least shouldn't have been) in action due to that.

    I then tried to bring node 2 and 3 back online by simply restarting them, this is what happened:

    Node1 floods the logs for a while at a rate of 20-100/sec:
    =ERROR REPORT==== 18-Mar-2013::07:10:40 ===
    Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)

    Start up node 3
    Floods
    =ERROR REPORT==== 18-Mar-2013::08:23:15 ===
    Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
    and is stuck at
    "starting exchange, queue and binding recovery ..."

    rabbitmqctl status hangs for ever on node 1

    Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and rabbitmqctl status and never returns

    node 2 then runs out of memory again, without client connections this time:
    =INFO REPORT==== 18-Mar-2013::09:09:35 ===
    vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
    =WARNING REPORT==== 18-Mar-2013::09:09:35 ===
    memory resource limit alarm set on node rabbit at tiger02

    Querying /api/overview at node1 gives:
    {error,{error,{badmatch,false},
    [{rabbit_mgmt_wm_overview,version,1},
    {rabbit_mgmt_wm_overview,to_json,2},
    {webmachine_resource,resource_call,3},
    {webmachine_resource,do,3},
    {webmachine_decision_core,resource_call,1},
    {webmachine_decision_core,decision,1},
    {webmachine_decision_core,handle_request,2},
    {rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}

    node 3 starts eventually.
    kills node 2, starts again, stops at "starting database ?"
    nothing in the log or startup_err, cpu usage 0%
    kills after 30min and starts again, same thing.

    node 3 can now output rabbitmqctl status, node 1 still cannot.
    node 1 can't be shutdown, force kills
    with node1 down, node 2 now comes pass "starting database" and starts
    neither node 2 or node 3 responds to rabbitmqctl status
    shutting down node 2, but doesn't respond, have to do kill -9
    node 3 still doesn't respond to rabbitmqctl status
    shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.

    note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.

    Starting up node3, log now gets flooded with:
    =ERROR REPORT==== 18-Mar-2013::11:09:04 ===
    ** Generic server <0.629.0> terminating
    ** Last message in was {init,<0.182.0>}
    ** When Server state == {q,{amqqueue,
    {resource,<<"vhost1">>,queue,
    <<"tmp_topic-0.9934227485209703">>},
    true,true,<0.21310.24>,[],<0.629.0>,[],[],
    [{vhost,<<"vhost1">>},
    {name,<<"HA">>},
    {pattern,<<".*">>},
    {definition,[{<<"ha-mode">>,<<"all">>}]},
    {priority,0}],
    [{<6868.7071.0>,<6868.7070.0>},
    {<6867.19845.80>,<6867.19844.80>},
    {<0.21601.24>,<0.21548.24>}]},
    none,false,undefined,undefined,
    {[],[]},
    undefined,undefined,undefined,undefined,
    {state,fine,5000,undefined},
    {0,nil},
    undefined,undefined,undefined,
    {dict,0,16,16,8,80,48,
    {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []},
    {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []}}},
    1,
    {{0,nil},{0,nil}},
    undefined,
    {dict,0,16,16,8,80,48,
    {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []},
    {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []}}},
    undefined,undefined}
    ** Reason for termination ==
    ** {'module could not be loaded',
    [{undefined,init,
    [{amqqueue,
    {resource,<<"vhost1">>,queue,
    <<"tmp_topic-0.9934227485209703">>},
    true,true,<0.21310.24>,[],<0.629.0>,[],[],
    [{vhost,<<"vhost1">>},
    {name,<<"HA">>},
    {pattern,<<".*">>},
    {definition,[{<<"ha-mode">>,<<"all">>}]},
    {priority,0}],
    [{<6868.7071.0>,<6868.7070.0>},
    {<6867.19845.80>,<6867.19844.80>},
    {<0.21601.24>,<0.21548.24>}]},
    true,#Fun<rabbit_amqqueue_process.5.64830354>]},
    {rabbit_amqqueue_process,handle_call,3},
    {gen_server2,handle_msg,2},
    {proc_lib,wake_up,3}]}

    but comes online eventually and can do "rabbitmqctl status"

    starts up node2, also reports a lot of:
    =ERROR REPORT==== 18-Mar-2013::11:11:06 ===
    ** Generic server <0.640.0> terminating
    ** Last message in was {init,<0.152.0>}
    ** When Server state == {q,{amqqueue,
    {resource,<<"vhost1">>,queue,
    <<"tmp_topic-0.1019297200255096">>},
    true,true,<0.977.11>,[],<0.640.0>,[],[],
    undefined,[]},
    none,false,undefined,undefined,
    {[],[]},
    undefined,undefined,undefined,undefined,
    {state,fine,5000,undefined},
    {0,nil},
    undefined,undefined,undefined,
    {dict,0,16,16,8,80,48,
    {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []},
    {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []}}},
    1,
    {{0,nil},{0,nil}},
    undefined,
    {dict,0,16,16,8,80,48,
    {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []},
    {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []}}},
    undefined,undefined}
    ** Reason for termination ==
    ** {'module could not be loaded',
    [{undefined,init,
    [{amqqueue,
    {resource,<<"vhost1">>,queue,
    <<"tmp_topic-0.1019297200255096">>},
    true,true,<0.977.11>,[],<0.640.0>,[],[],undefined,[]},
    true,#Fun<rabbit_amqqueue_process.5.64830354>]},
    {rabbit_amqqueue_process,handle_call,3},
    {gen_server2,handle_msg,2},
    {proc_lib,wake_up,3}]}
    =ERROR REPORT==== 18-Mar-2013::11:11:06 ===
    ** Generic server <0.645.0> terminating
    ** Last message in was {init,<0.152.0>}
    ** When Server state == {q,{amqqueue,
    {resource,<<"vhost1">>,queue,
    <<"tmp_topic-0.8794151877518743">>},
    true,true,<0.30538.0>,[],<0.645.0>,[],[],
    [{vhost,<<"vhost1">>},
    {name,<<"HA">>},
    {pattern,<<".*">>},
    {definition,[{<<"ha-mode">>,<<"all">>}]},
    {priority,0}],
    [{<6872.28270.5>,<6872.28269.5>},
    {<0.32304.1>,<0.30804.0>}]},
    none,false,undefined,undefined,
    {[],[]},
    undefined,undefined,undefined,undefined,
    {state,fine,5000,undefined},
    {0,nil},
    undefined,undefined,undefined,
    {dict,0,16,16,8,80,48,
    {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []},
    {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []}}},
    1,
    {{0,nil},{0,nil}},
    undefined,
    {dict,0,16,16,8,80,48,
    {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []},
    {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
    []}}},
    undefined,undefined}
    ** Reason for termination ==
    ** {'module could not be loaded',
    [{undefined,init,
    [{amqqueue,
    {resource,<<"vhost1">>,queue,
    <<"tmp_topic-0.8794151877518743">>},
    true,true,<0.30538.0>,[],<0.645.0>,[],[],
    [{vhost,<<"vhost1">>},
    {name,<<"HA">>},
    {pattern,<<".*">>},
    {definition,[{<<"ha-mode">>,<<"all">>}]},
    {priority,0}],
    [{<6872.28270.5>,<6872.28269.5>},{<0.32304.1>,<0.30804.0>}]},
    true,#Fun<rabbit_amqqueue_process.5.64830354>]},
    {rabbit_amqqueue_process,handle_call,3},
    {gen_server2,handle_msg,2},
    {proc_lib,wake_up,3}]}

    node 2 comes online i can now query rabbitmqctl status
    starting up node 1, comes online
    the cluster is now working again but several durables queues are gone(!)




    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Tim Watson at Mar 20, 2013 at 6:03 pm
    Hi Carl,

    On 20 Mar 2013, at 16:25, Tim Watson wrote:
    What version of rabbit are you running? A number of bugs pertaining to the 'Discarding message ... in an old incarnation .. of this node' were fixed in recent(ish) releases.

    And another couple of questions if that's ok. Firstly - how did you install RabbitMQ on each of these nodes? It's possible one or more installs is corrupted somehow - have you made any modifications to the installs? What does the config look like for each of the nodes?

    On 19 Mar 2013, at 03:41, Carl H?rberg wrote:
    Node1 floods the logs for a while at a rate of 20-100/sec:
    =ERROR REPORT==== 18-Mar-2013::07:10:40 ===
    Discarding message {'$gen_call',{<0.17965.1>,#Ref<0.0.1.90282>},stat} from <0.17965.1> to <0.5037.1> in an old incarnation (1) of this node (2)

    Start up node 3
    Floods
    =ERROR REPORT==== 18-Mar-2013::08:23:15 ===
    Discarding message {'$gen_call',{<0.7609.0>,#Ref<0.0.1.142489>},stat} from <0.7609.0> to <0.25515.26> in an old incarnation (1) of this node (3)
    and is stuck at
    "starting exchange, queue and binding recovery ..."

    This 'old incarnation of ... ' stuff indicates that we have a process id for a queue that is no longer valid. In theory, the only way (I can see) for this to happen is if a queue master restarts faster than any of the slaves can detect it's death (we have an outstanding bug to look at that, but it may not be relevant since recent releases have included several HA bug fixes) - but regardless, that kind of problem ought to present far earlier than the 'stat' request that's failing...

    Start up node 2, starts fast, says "Broker started" in startup_log, but doesn't list the plugins, "service rabbitmq-server start" never returns and rabbitmqctl status and never returns

    That sounds suspicious - are you sure the enabled-plugins file and configuration for that node are intact?

    node 2 then runs out of memory again, without client connections this time:
    =INFO REPORT==== 18-Mar-2013::09:09:35 ===
    vm_memory_high_watermark set. Memory used:7336394640 allowed:7031336140
    =WARNING REPORT==== 18-Mar-2013::09:09:35 ===
    memory resource limit alarm set on node rabbit at tiger02

    Is this happening whilst node 1 is still stuck? How long does it take (roughly) to reach this state?

    Querying /api/overview at node1 gives:
    {error,{error,{badmatch,false},
    [{rabbit_mgmt_wm_overview,version,1},
    {rabbit_mgmt_wm_overview,to_json,2},
    {webmachine_resource,resource_call,3},
    {webmachine_resource,do,3},
    {webmachine_decision_core,resource_call,1},
    {webmachine_decision_core,decision,1},
    {webmachine_decision_core,handle_request,2},
    {rabbit_webmachine,'-makeloop/1-fun-0-',2}]}}

    What version of Erlang are you running? Upgrading to a recent version of Erlang would be a good idea due to bug fixes and the fact that line numbers in exception stack traces would make it easier to identify where things are going wrong.


    For that matter, what OS/Platform are you running on? How did you install Erlang?

    node 3 starts eventually.
    kills node 2, starts again, stops at "starting database ?"

    What do you mean 'kills node 2' exactly? A node will never kill another node. Do you mean that 'you' killed node 2? If so, how did you do this?

    nothing in the log or startup_err, cpu usage 0%
    kills after 30min and starts again, same thing.

    Again, what do you mean 'kills after 30min and starts again' - is this something you're doing? How are you 'killing' these nodes?

    node 3 can now output rabbitmqctl status, node 1 still cannot.
    node 1 can't be shutdown, force kills

    Right - so at this point you've done something like `kill -9` right?

    with node1 down, node 2 now comes pass "starting database" and starts
    neither node 2 or node 3 responds to rabbitmqctl status

    For how long do they not respond? I wonder if it could be that all these 'kill' signals you're issuing have left the mnesia database in an inconsistent state somehow.

    shutting down node 2, but doesn't respond, have to do kill -9

    'shutting down node 2' how - are you issuing `sudo rabbitmqctl stop` to do that?

    node 3 still doesn't respond to rabbitmqctl status
    shutdowns node 3, doesnt respond, killing it instead, now all nodes are down.

    The same approach right?

    note: When rabbitmqctl status doesnt work other stuff like list_users, cluster_status etc. works.

    Sounds like a process is stuck somewhere - the status call attempts to list all running erlang applications on the node, with the timeout set to 'infinity'. If an application has got stuck during startup (or shutdown!) that can be one of the symptoms. Again, please tell us which version of rabbit you're running. We've fixed bugs in (relatively) recent releases that presented as supervision trees getting stuck during shutdown/restart, which might (possibly) explain some of this.

    Starting up node3, log now gets flooded with:
    =ERROR REPORT==== 18-Mar-2013::11:09:04 ===
    ** Generic server <0.629.0> terminating
    ** Last message in was {init,<0.182.0>}
    ** When Server state == {q,{amqqueue,
    [snip]
    ** Reason for termination ==
    ** {'module could not be loaded',
    [{undefined,init,
    [snip]


    This error has occurred because the backing queue module for the queue process is set to 'undefined' - have you made any configuration changes, such as setting the name of the backing queue module by any chance?


    Please let us know the answers to these queries and we'll try to figure out what's going on.


    Cheers,
    Tim
  • Carlhoerberg at Mar 23, 2013 at 3:01 am
    RabbitMQ 3.0.4, Erlang R14B04. Ubuntu 12.04, installed with apt-get install
    from your ppa, nothing custom at all.


    Nothing exciting in the config:
    [
    {rabbit, [
    {log_levels, [{connection, error}]},
    {vm_memory_high_watermark, 0.8},
    {tcp_listeners, [{"0.0.0.0", 5672}]},
    {ssl_listeners, [{"0.0.0.0", 5671}]},
    {ssl_options, [{cacertfile,"/etc/rabbitmq/ca.pem"},
    {certfile,"/etc/rabbitmq/key.pem"}
    ]}
    ]},
    {rabbitmq_management,
    [{listener, [{port, 15672},
    {ip, "0.0.0.0"},
    {ssl, true}
    ]}
    ]}
    ].


    Only the mgmt plugin enabled.


    I had no problems with file corruption as far as i know.


    Yes, I was "killing" the nodes, with kill -9


    No, haven't touched the backing queue or anything like that.






    --
    View this message in context: http://rabbitmq.1065348.n5.nabble.com/Bring-cluster-up-after-node-crash-tp25530p25676.html
    Sent from the RabbitMQ mailing list archive at Nabble.com.
  • Tim Watson at Mar 23, 2013 at 12:37 pm
    Hi

    On 23 Mar 2013, at 03:01, carlhoerberg wrote:
    RabbitMQ 3.0.4, Erlang R14B04. Ubuntu 12.04, installed with apt-get install
    from your ppa, nothing custom at all.

    I would strongly advise you to upgrade to the latest erlang if possible. Many very important bug fixed have been incorporated since R14B.

    I had no problems with file corruption as far as i know. [snip]
    Yes, I was "killing" the nodes, with kill -9

    You shouldn't (have to) be doing that, obviously. It's possible (though somewhat unlikely) that a brutal kill might've let your file system in an inconsistent state.

    No, haven't touched the backing queue or anything like that.

    Is there anything else in the logs you can give us to go on?


    --
    View this message in context: http://rabbitmq.1065348.n5.nabble.com/Bring-cluster-up-after-node-crash-tp25530p25676.html
    Sent from the RabbitMQ mailing list archive at Nabble.com.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Carlhoerberg at Mar 25, 2013 at 4:55 am
    What's the recommended way of using a recent erlang version and rabbitmq in
    ubuntu?
    erlang-solutions and your ubuntu package doesn't seem to play well together,
    found this "hack" though, https://gist.github.com/RJ/2284940


    We had to "kill" them, as rabbitmqctl stop didn't respond at all..






    --
    View this message in context: http://rabbitmq.1065348.n5.nabble.com/Bring-cluster-up-after-node-crash-tp25530p25682.html
    Sent from the RabbitMQ mailing list archive at Nabble.com.
  • Jean Paul Galea at Mar 25, 2013 at 7:35 am
    I ran into the same problem when installing Erlang from erlang-solutions.


    The problem is with the dependency "erlang-nox" which is declared in the
    RabbitMQ package. This package is a meta-package which tries to install
    erlang packages from the Ubuntu repository, conflicting with the ones
    from erlang-solutions.


    I also found the hack that you are linking to, but I think it can be
    done more elegantly.


    I wrote this simple bash script; fetches rabbitmq-server package using
    apt-get, drops the erlang-nox dependency and installs it.


    One thing that may bother you is that apt-get will always flag
    rabbitmq-server as "upgradeable", since the installed package differs
    from the one in the repo.


    However, if you do something like `apt-get update && apt-get upgrade &&
    apt-get dist-upgrade` it will __not__ automatically re-install the package.


    Also note that the script does not actually remove the "erlang-nox"
    dependency, rather it simply replaces the whole line with "adduser,
    logrotate", hence if the RabbitMQ team declares a new dependency, this
    script would need to be updated.


    ------------------------


    cat > /tmp/rabbitmq-install.sh << "EOF"
    #!/bin/bash


    TMPDIR1=`mktemp -d` || exit 99
    TMPDIR2=`mktemp -d` || exit 99
    trap 'rm -rf "$TMPDIR1" "$TMPDIR2"' 0 1 2 3 13 15


    cd $TMPDIR1


    /usr/bin/apt-get download rabbitmq-server || exit 1


    PACKAGE=`ls -1`


    /usr/bin/dpkg-deb --extract $PACKAGE $TMPDIR2
    /usr/bin/dpkg-deb --control $PACKAGE ${TMPDIR2}/DEBIAN
    sed --in-place 's/^Depends:.*$/Depends: adduser, logrotate/'
    ${TMPDIR2}/DEBIAN/control
    /usr/bin/dpkg --build $TMPDIR2 ${PACKAGE}.modified
    /usr/bin/dpkg --install ${PACKAGE}.modified
    EOF


    /bin/bash /tmp/rabbitmq-install.sh


    rm /tmp/rabbitmq-install.sh



    On 03/25/2013 05:55 AM, carlhoerberg wrote:
    What's the recommended way of using a recent erlang version and rabbitmq in
    ubuntu?
    erlang-solutions and your ubuntu package doesn't seem to play well together,
    found this "hack" though, https://gist.github.com/RJ/2284940

    We had to "kill" them, as rabbitmqctl stop didn't respond at all..



    --
    View this message in context: http://rabbitmq.1065348.n5.nabble.com/Bring-cluster-up-after-node-crash-tp25530p25682.html
    Sent from the RabbitMQ mailing list archive at Nabble.com.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedMar 19, '13 at 3:41a
activeMar 25, '13 at 7:35a
posts7
users4
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase