I have a 3 machine cluster, 1 disc node and 2 ram nodes.


I stopped (soft, no crash or kill) the disc node.


One of the RAM nodes reports:


=ERROR REPORT==== 19-Nov-2012::08:29:32 ===
** Generic server <0.865.329> terminating
** Last message in was {'$gen_cast',
{event,
{event,channel_stats,
[{pid,<24148.17585.32>},
{transactional,false},
{confirm,false},
{consumer_count,1},
{messages_unacknowledged,0},
{messages_unconfirmed,0},
{messages_uncommitted,0},
{acks_uncommitted,0},
{prefetch_count,16},
{client_flow_blocked,false},
{channel_queue_stats,[{<6988.1853.0>,[{ack,7}]}]},
{channel_exchange_stats,
[{{resource,<<"vhost1">>,exchange,
<<"queue1">>},
[{publish,34649}]}]},
{channel_queue_exchange_stats,[]}],
{1353,313843,894086}}}}
** When Server state == {state,[{channel_exchange_stats,2244223060},
{channel_queue_exchange_stats,2244227157},
{channel_queue_stats,2244218963},
{channel_stats,2244210768},
{connection_stats,2244206619},
{consumers,2244214866},
{queue_stats,2244202517}],
5000}
** Reason for termination ==
** {badarith,[{rabbit_mgmt_db,rate,5},
{rabbit_mgmt_db,'-rates/5-lc$^0/1-0-',5},
{rabbit_mgmt_db,'-rates/5-lc$^1/1-1-',6},
{rabbit_mgmt_db,rates,5},
{rabbit_mgmt_db,handle_fine_stat,7},
{rabbit_mgmt_db,'-handle_fine_stats/4-lc$^1/1-1-',4},
{rabbit_mgmt_db,'-handle_event/2-lc$^1/1-0-',4},
{rabbit_mgmt_db,handle_event,2}]}


The disc node was started (with a new IP), and it logs this:


=WARNING REPORT==== 19-Nov-2012::22:55:27 ===
msg_store_persistent: recovery terms differ from present
rebuilding indices from scratch






And it takes about 10-15min before it starts.


But the other nodes never recognized the node, may it be due to that the DNS wasn't updated? It took about 1min before the DNS resolved correctly, that is the disc node's hostname resolved to the new ip, but I waited longer than that.


Meanwhile the disc node reported a lot of these messages:


=ERROR REPORT==== 19-Nov-2012::23:06:55 ===
Discarding message {'$gen_call',{<0.2291.0>,#Ref<0.0.5.101573>},{notify_down,<5145.1671.3>}} from <0.2291.0> to <0.1733.0> in an old incarnation (3) of this node (2)






So I restarted the RAM nodes too, now all cluster nodes could communicate again, but the mgmt interface reported:


"Statistics database could not be contacted. Message rates and queue lengths will not be shown."


So stopped all nodes (first ram nodes and the disc node last), and then brought the back up again (disc node first) and now the cluster functioned as normal.


Any idea what was going on?


RabbitMQ 2.8.7
Erlang R14B4

Search Discussions

  • Tim Watson at Nov 23, 2012 at 5:09 pm
    Hi Carl,


    On 22 Nov 2012, at 05:01, Carl H?rberg wrote:

    I have a 3 machine cluster, 1 disc node and 2 ram nodes.

    I stopped (soft, no crash or kill) the disc node.

    It's best *not* to leave a cluster with no disk nodes, as that represents an inherent risk if the ram nodes die whilst the disk node is out of action (for maintenance or whatever). Better to have 2 disk and 1 ram nodes if possible.

    One of the RAM nodes reports:

    =ERROR REPORT==== 19-Nov-2012::08:29:32 ===
    ** Generic server <0.865.329> terminating
    ** Last message in was {'$gen_cast',
    {event,
    {event,channel_stats,
    [{pid,<24148.17585.32>},
    {transactional,false},
    {confirm,false},
    {consumer_count,1},
    {messages_unacknowledged,0},
    {messages_unconfirmed,0},
    {messages_uncommitted,0},
    {acks_uncommitted,0},
    {prefetch_count,16},
    {client_flow_blocked,false},
    {channel_queue_stats,[{<6988.1853.0>,[{ack,7}]}]},
    {channel_exchange_stats,
    [{{resource,<<"vhost1">>,exchange,
    <<"queue1">>},
    [{publish,34649}]}]},
    {channel_queue_exchange_stats,[]}],
    {1353,313843,894086}}}}
    ** When Server state == {state,[{channel_exchange_stats,2244223060},
    {channel_queue_exchange_stats,2244227157},
    {channel_queue_stats,2244218963},
    {channel_stats,2244210768},
    {connection_stats,2244206619},
    {consumers,2244214866},
    {queue_stats,2244202517}],
    5000}
    ** Reason for termination ==
    ** {badarith,[{rabbit_mgmt_db,rate,5},
    {rabbit_mgmt_db,'-rates/5-lc$^0/1-0-',5},
    {rabbit_mgmt_db,'-rates/5-lc$^1/1-1-',6},
    {rabbit_mgmt_db,rates,5},
    {rabbit_mgmt_db,handle_fine_stat,7},
    {rabbit_mgmt_db,'-handle_fine_stats/4-lc$^1/1-1-',4},
    {rabbit_mgmt_db,'-handle_event/2-lc$^1/1-0-',4},
    {rabbit_mgmt_db,handle_event,2}]}

    This is a known bug I'm afraid. See http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-November/023538.html for another instance. We're working to resolve it at the moment and hope to get a fix out asap.

    The disc node was started (with a new IP), and it logs this:

    =WARNING REPORT==== 19-Nov-2012::22:55:27 ===
    msg_store_persistent: recovery terms differ from present
    rebuilding indices from scratch

    I would not expect a node stopped with `rabbitmqctl stop` to exhibit this behaviour, however the fact that you've recycled the IP address could be a factor - though it shouldn't. We'll spend some time looking into that.

    And it takes about 10-15min before it starts.

    It is not unusual for a message store containing a large volume of messages to take 'quite some time' to completely recover.

    But the other nodes never recognized the node, may it be due to that the DNS wasn't updated? It took about 1min before the DNS resolved correctly, that is the disc node's hostname resolved to the new ip, but I waited longer than that.

    Meanwhile the disc node reported a lot of these messages:

    =ERROR REPORT==== 19-Nov-2012::23:06:55 ===
    Discarding message {'$gen_call',{<0.2291.0>,#Ref<0.0.5.101573>},{notify_down,<5145.1671.3>}} from <0.2291.0> to <0.1733.0> in an old incarnation (3) of this node (2)

    This could be indicative of a bug, as it indicates a channel close notification being sent to a queue record that should've been deleted (or replaced). We will investigate that.

    So I restarted the RAM nodes too, now all cluster nodes could communicate again, but the mgmt interface reported:

    "Statistics database could not be contacted. Message rates and queue lengths will not be shown."

    So stopped all nodes (first ram nodes and the disc node last), and then brought the back up again (disc node first) and now the cluster functioned as normal.

    Well I'm glad you got it under control eventually.

    Any idea what was going on?

    RabbitMQ 2.8.7
    Erlang R14B4

    Could you clarify one or two other details about your environment please? Firstly, do you have HA (mirrorred) queues set up in this cluster? Secondly, was the disk node designated the 'master' when you shut it down? And just to confirm that you shut the disk node down by issuing `rabbitmqctl stop` or an equivalent (such as `service rabbitmqctl stop` or similar) right? If you could provide any of the node's rabbit and rabbit-sasl logs, that would be very useful.


    Thanks for reporting this - I will try to keep the list informed of any developments in tracking it down!


    Cheers,
    Tim
  • Matthias Radestock at Nov 23, 2012 at 9:39 pm

    On 23/11/12 17:09, Tim Watson wrote:
    It is not unusual for a message store containing a large volume of
    messages to take 'quite some time' to completely recover.

    Though only if the node wasn't shut down cleanly.


    Matthias.
  • Carlhoerberg at Dec 3, 2012 at 2:29 pm
    Ok, thank you!


    I shut the server down with "shutdown -h now", so upstart should've stopped
    rabbitmq gracefully.


    Yes, there's HA queues setup, i load balance client connections through
    round-robin DNS, so different nodes were masters for different queues.






    --
    View this message in context: http://rabbitmq.1065348.n5.nabble.com/Cluster-issues-tp23604p23798.html
    Sent from the RabbitMQ mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedNov 22, '12 at 5:01a
activeDec 3, '12 at 2:29p
posts4
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase