While testing a fail-over scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors, sometimes resulting in durable queues vanishing.


The cluster consists of two brokers using LVS / keepalived in order to connect clients to a functional broker. There are 10 mirrored queues, each of which has ha-sync-mode = automatic. A script is used to shut down one broker or the other in turn using 'service rabbitmq-server {start|stop}', such that there is always one broker running and leaving at least 30 seconds between each start / stop. I am expecting that this test should be able to run indefinitely without destabilising the cluster, however I have not been able to achieve more than a few dozen fail-overs without some error occurring. I'm hoping someone may have some insight or suggestions as to how to stabilise this environment.


I have included basic environment details below and attached logs from both brokers showing one example. In this case zg-dev-mq-003 was stopped at 11:32:21 and went through what appears to be a clean shutdown:


=INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM


zg-dev-mq-002 detected the other broker down and promoted itself to master. Then after accepting connections from clients it logged an error as shown below:


=INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node 'rabbit at zg-dev-mq-003' down
=INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.427.0> (10.0.72.36:61434 -> 172.17.0.73:5672)
=INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
=ERROR REPORT==== 9-Jun-2013::11:33:22 ===
** Generic server <0.418.0> terminating
** Last message in was {'$gen_cast',
                         {delete_and_terminate,
                          {badarg,
                           [{ets,insert_new,
                             [360523,
                              {{<<10,71,177,42,66,240,207,204,251,26,181,155,
                                  246,83,172,137>>,
                                <<120,196,170,245,109,158,126,84,92,250,21,193,
                                  123,113,128,48>>},
                               -1}],
                             []},
                            {rabbit_msg_store,client_update_flying,3,[]},
                            {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
                            {rabbit_msg_store,remove,2,[]},
                            {rabbit_variable_queue,
                             '-with_immutable_msg_store_state/3-fun-0-',2,[]},
                            {rabbit_variable_queue,with_msg_store_state,3,[]},
                            {rabbit_variable_queue,
                             with_immutable_msg_store_state,3,[]},
                            {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2,
                             []}]}}}
etc


Environment details (same for both brokers):


[root at zg-dev-mq-002]# uname -a
Linux zg-dev-mq-002.zettagrid.local 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


[root at zg-dev-mq-002]# cat /etc/centos-release
CentOS release 6.4 (Final)


[root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
esl-erlang.x86_64 R16B-2 @/esl-erlang-R16B-2.x86_64
esl-erlang-compat.noarch R14B-1.el6 @/esl-erlang-compat-R14B-1.el6.noarch
rabbitmq-server.noarch 3.1.1-1 @/rabbitmq-server-3.1.1-1.noarch


Thanks very much,


Nathanael


________________________________


ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: logs.zip
Type: application/x-zip-compressed
Size: 331237 bytes
Desc: logs.zip
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130609/34895bfb/attachment.bin>

Search Discussions

  • Simon MacMullen at Jun 10, 2013 at 10:04 am
    Hi. Looking at the logs it seems like the message store on mq-002
    crashed / shut down unexpectedly, but there's no information about this
    in the log. Do you have the corresponding sasl log?


    Cheers, Simon

    On 09/06/13 06:03, Rensen, Nathanael wrote:
    While testing a fail-over scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors, sometimes resulting in durable queues vanishing.

    The cluster consists of two brokers using LVS / keepalived in order to connect clients to a functional broker. There are 10 mirrored queues, each of which has ha-sync-mode = automatic. A script is used to shut down one broker or the other in turn using 'service rabbitmq-server {start|stop}', such that there is always one broker running and leaving at least 30 seconds between each start / stop. I am expecting that this test should be able to run indefinitely without destabilising the cluster, however I have not been able to achieve more than a few dozen fail-overs without some error occurring. I'm hoping someone may have some insight or suggestions as to how to stabilise this environment.

    I have included basic environment details below and attached logs from both brokers showing one example. In this case zg-dev-mq-003 was stopped at 11:32:21 and went through what appears to be a clean shutdown:

    =INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM

    zg-dev-mq-002 detected the other broker down and promoted itself to master. Then after accepting connections from clients it logged an error as shown below:

    =INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node 'rabbit at zg-dev-mq-003' down
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.427.0> (10.0.72.36:61434 -> 172.17.0.73:5672)
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
    =ERROR REPORT==== 9-Jun-2013::11:33:22 ===
    ** Generic server <0.418.0> terminating
    ** Last message in was {'$gen_cast',
    {delete_and_terminate,
    {badarg,
    [{ets,insert_new,
    [360523,
    {{<<10,71,177,42,66,240,207,204,251,26,181,155,
    246,83,172,137>>,
    <<120,196,170,245,109,158,126,84,92,250,21,193,
    123,113,128,48>>},
    -1}],
    []},
    {rabbit_msg_store,client_update_flying,3,[]},
    {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
    {rabbit_msg_store,remove,2,[]},
    {rabbit_variable_queue,
    '-with_immutable_msg_store_state/3-fun-0-',2,[]},
    {rabbit_variable_queue,with_msg_store_state,3,[]},
    {rabbit_variable_queue,
    with_immutable_msg_store_state,3,[]},
    {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2,
    []}]}}}
    etc

    Environment details (same for both brokers):

    [root at zg-dev-mq-002]# uname -a
    Linux zg-dev-mq-002.zettagrid.local 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

    [root at zg-dev-mq-002]# cat /etc/centos-release
    CentOS release 6.4 (Final)

    [root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
    esl-erlang.x86_64 R16B-2 @/esl-erlang-R16B-2.x86_64
    esl-erlang-compat.noarch R14B-1.el6 @/esl-erlang-compat-R14B-1.el6.noarch
    rabbitmq-server.noarch 3.1.1-1 @/rabbitmq-server-3.1.1-1.noarch

    Thanks very much,

    Nathanael

    ________________________________

    ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.



    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



    --
    Simon MacMullen
    RabbitMQ, Pivotal
  • Rensen, Nathanael at Jun 10, 2013 at 10:32 am
    I've attached the sasl log from mq-002. Sorry I didn't include that originally.


    Thanks for taking a look.


    Nathanael




    Simon MacMullen wrote:


    Hi. Looking at the logs it seems like the message store on mq-002 crashed / shut down unexpectedly, but there's no information about this in the log. Do you have the corresponding sasl log?


    Cheers, Simon




    On 09/06/13 06:03, Rensen, Nathanael wrote:
    While testing a fail-over scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors, sometimes resulting in durable queues vanishing.


    The cluster consists of two brokers using LVS / keepalived in order to connect clients to a functional broker. There are 10 mirrored queues, each of which has ha-sync-mode = automatic. A script is used to shut down one broker or the other in turn using 'service rabbitmq-server {start|stop}', such that there is always one broker running and leaving at least 30 seconds between each start / stop. I am expecting that this test should be able to run indefinitely without destabilising the cluster, however I have not been able to achieve more than a few dozen fail-overs without some error occurring. I'm hoping someone may have some insight or suggestions as to how to stabilise this environment.


    I have included basic environment details below and attached logs from both brokers showing one example. In this case zg-dev-mq-003 was stopped at 11:32:21 and went through what appears to be a clean shutdown:


    =INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM


    zg-dev-mq-002 detected the other broker down and promoted itself to master. Then after accepting connections from clients it logged an error as shown below:


    =INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node 'rabbit at zg-dev-mq-003' down
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.427.0> (10.0.72.36:61434 -> 172.17.0.73:5672)
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
    =ERROR REPORT==== 9-Jun-2013::11:33:22 ===
    ** Generic server <0.418.0> terminating
    ** Last message in was {'$gen_cast',
                              {delete_and_terminate,
                               {badarg,
                                [{ets,insert_new,
                                  [360523,
                                   {{<<10,71,177,42,66,240,207,204,251,26,181,155,
                                       246,83,172,137>>,
                                     <<120,196,170,245,109,158,126,84,92,250,21,193,
                                       123,113,128,48>>},
                                    -1}],
                                  []},
                                 {rabbit_msg_store,client_update_flying,3,[]},
                                 {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
                                 {rabbit_msg_store,remove,2,[]},
                                 {rabbit_variable_queue,
                                  '-with_immutable_msg_store_state/3-fun-0-',2,[]},
                                 {rabbit_variable_queue,with_msg_store_state,3,[]},
                                 {rabbit_variable_queue,
                                  with_immutable_msg_store_state,3,[]},
                                 {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2,
                                  []}]}}}
    etc


    Environment details (same for both brokers):


    [root at zg-dev-mq-002]# uname -a
    Linux zg-dev-mq-002.zettagrid.local 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


    [root at zg-dev-mq-002]# cat /etc/centos-release
    CentOS release 6.4 (Final)


    [root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
    esl-erlang.x86_64 R16B-2 @/esl-erlang-R16B-2.x86_64
    esl-erlang-compat.noarch R14B-1.el6 @/esl-erlang-compat-R14B-1.el6.noarch
    rabbitmq-server.noarch 3.1.1-1 @/rabbitmq-server-3.1.1-1.noarch


    Thanks very much,


    Nathanael


    ________________________________


    ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.


    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: rabbit at zg-dev-mq-002-sasl.zip
    Type: application/x-zip-compressed
    Size: 7786 bytes
    Desc: rabbit at zg-dev-mq-002-sasl.zip
    URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130610/6aae20b9/attachment.bin>
  • Simon MacMullen at Jun 10, 2013 at 2:19 pm
    Hi thanks. This is definitely an odd looking error, can you tell us mre
    about what you're doing? Are you just starting / stopping nodes, or is
    there messaging activity going on at the same time (and if so, what?)


    Cheers, Simon

    On 10/06/13 11:32, Rensen, Nathanael wrote:
    I've attached the sasl log from mq-002. Sorry I didn't include that originally.

    Thanks for taking a look.

    Nathanael


    Simon MacMullen wrote:

    Hi. Looking at the logs it seems like the message store on mq-002 crashed / shut down unexpectedly, but there's no information about this in the log. Do you have the corresponding sasl log?

    Cheers, Simon


    On 09/06/13 06:03, Rensen, Nathanael wrote:
    While testing a fail-over scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors, sometimes resulting in durable queues vanishing.

    The cluster consists of two brokers using LVS / keepalived in order to connect clients to a functional broker. There are 10 mirrored queues, each of which has ha-sync-mode = automatic. A script is used to shut down one broker or the other in turn using 'service rabbitmq-server {start|stop}', such that there is always one broker running and leaving at least 30 seconds between each start / stop. I am expecting that this test should be able to run indefinitely without destabilising the cluster, however I have not been able to achieve more than a few dozen fail-overs without some error occurring. I'm hoping someone may have some insight or suggestions as to how to stabilise this environment.

    I have included basic environment details below and attached logs from both brokers showing one example. In this case zg-dev-mq-003 was stopped at 11:32:21 and went through what appears to be a clean shutdown:

    =INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM

    zg-dev-mq-002 detected the other broker down and promoted itself to master. Then after accepting connections from clients it logged an error as shown below:

    =INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node 'rabbit at zg-dev-mq-003' down
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.427.0> (10.0.72.36:61434 -> 172.17.0.73:5672)
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
    =ERROR REPORT==== 9-Jun-2013::11:33:22 ===
    ** Generic server <0.418.0> terminating
    ** Last message in was {'$gen_cast',
    {delete_and_terminate,
    {badarg,
    [{ets,insert_new,
    [360523,
    {{<<10,71,177,42,66,240,207,204,251,26,181,155,
    246,83,172,137>>,
    <<120,196,170,245,109,158,126,84,92,250,21,193,
    123,113,128,48>>},
    -1}],
    []},
    {rabbit_msg_store,client_update_flying,3,[]},
    {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
    {rabbit_msg_store,remove,2,[]},
    {rabbit_variable_queue,
    '-with_immutable_msg_store_state/3-fun-0-',2,[]},
    {rabbit_variable_queue,with_msg_store_state,3,[]},
    {rabbit_variable_queue,
    with_immutable_msg_store_state,3,[]},
    {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2,
    []}]}}}
    etc

    Environment details (same for both brokers):

    [root at zg-dev-mq-002]# uname -a
    Linux zg-dev-mq-002.zettagrid.local 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

    [root at zg-dev-mq-002]# cat /etc/centos-release
    CentOS release 6.4 (Final)

    [root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
    esl-erlang.x86_64 R16B-2 @/esl-erlang-R16B-2.x86_64
    esl-erlang-compat.noarch R14B-1.el6 @/esl-erlang-compat-R14B-1.el6.noarch
    rabbitmq-server.noarch 3.1.1-1 @/rabbitmq-server-3.1.1-1.noarch

    Thanks very much,

    Nathanael

    ________________________________

    ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.



    --
    Simon MacMullen
    RabbitMQ, Pivotal
  • Rensen, Nathanael at Jun 10, 2013 at 3:26 pm
    Thanks for taking the time to look at those logs.


    The purpose of the test is to confirm that the client application, server application, LVS and RMQ itself all perform properly during fail-over. There is certainly messaging activity during the test, although relatively modest by RMQ standards judging from the rates I see discussed on this forum.


    Of the 10 queues, three queues were operating with a combined average rate of around 200 messages per second from client app to server app. The publish rate will burst to instantaneous peaks possibly exceeding 500 messages per second for a short time. A fourth queue is returning responses from server app to client app at somewhere around 300 to 400 messages per second. A fifth queue is used for application heartbeats from server to client at around 1 message every 2 seconds. The remaining 5 queues were idle during the test. The messages are in the range of hundreds of bytes to a couple of kilobytes. Each message has a per-message TTL configured - 10 seconds for the application heartbeats, 90 seconds otherwise.
    --
    When either client or server application detects a channel / connection shutdown it enters a reconnect cycle making two attempts to connect each second until it is able to re-establish connection and resume operation. Each application uses a single RMQ connection for consume and publish. All publishing happens on a single channel per app using publisher confirms.


    I am able to reliably reproduce an error of some kind by running the test for long enough (a couple of dozen fail-overs is typically enough, although sometimes fewer than 10). I haven't paid sufficiently close attention to be able to say whether the error is always the same. However the symptoms are not always exactly the same. Usually after an error has occurred one or other broker will refuse to shutdown gracefully. Sometimes queues vanish and I will have to reconfigure them.


    In order to gather a clean sample for the logs I posted I deleted /var/lib/rabbitmq/mnesia and reconfigured the cluster and queues from scratch. I then rebooted the nodes to ensure a clean starting point.


    Thanks again,


    Nathanael




    Simon MacMullen wrote:


    Hi thanks. This is definitely an odd looking error, can you tell us mre about what you're doing? Are you just starting / stopping nodes, or is there messaging activity going on at the same time (and if so, what?)


    Cheers, Simon




    On 10/06/13 11:32, Rensen, Nathanael wrote:
    I've attached the sasl log from mq-002. Sorry I didn't include that originally.


    Thanks for taking a look.


    Nathanael


    Simon MacMullen wrote:


    Hi. Looking at the logs it seems like the message store on mq-002 crashed / shut down unexpectedly, but there's no information about this in the log. Do you have the corresponding sasl log?


    Cheers, Simon




    On 09/06/13 06:03, Rensen, Nathanael wrote:
    While testing a fail-over scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors, sometimes resulting in durable queues vanishing.


    The cluster consists of two brokers using LVS / keepalived in order to connect clients to a functional broker. There are 10 mirrored queues, each of which has ha-sync-mode = automatic. A script is used to shut down one broker or the other in turn using 'service rabbitmq-server {start|stop}', such that there is always one broker running and leaving at least 30 seconds between each start / stop. I am expecting that this test should be able to run indefinitely without destabilising the cluster, however I have not been able to achieve more than a few dozen fail-overs without some error occurring. I'm hoping someone may have some insight or suggestions as to how to stabilise this environment.


    I have included basic environment details below and attached logs from both brokers showing one example. In this case zg-dev-mq-003 was stopped at 11:32:21 and went through what appears to be a clean shutdown:


    =INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM


    zg-dev-mq-002 detected the other broker down and promoted itself to master. Then after accepting connections from clients it logged an error as shown below:


    =INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node 'rabbit at zg-dev-mq-003' down
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.427.0> (10.0.72.36:61434 -> 172.17.0.73:5672)
    =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
    =ERROR REPORT==== 9-Jun-2013::11:33:22 ===
    ** Generic server <0.418.0> terminating
    ** Last message in was {'$gen_cast',
                               {delete_and_terminate,
                                {badarg,
                                 [{ets,insert_new,
                                   [360523,
                                    {{<<10,71,177,42,66,240,207,204,251,26,181,155,
                                        246,83,172,137>>,
                                      <<120,196,170,245,109,158,126,84,92,250,21,193,
                                        123,113,128,48>>},
                                     -1}],
                                   []},
                                  {rabbit_msg_store,client_update_flying,3,[]},
                                  {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
                                  {rabbit_msg_store,remove,2,[]},
                                  {rabbit_variable_queue,
                                   '-with_immutable_msg_store_state/3-fun-0-',2,[]},
                                  {rabbit_variable_queue,with_msg_store_state,3,[]},
                                  {rabbit_variable_queue,
                                   with_immutable_msg_store_state,3,[]},
                                  {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2,
                                   []}]}}}
    etc


    Environment details (same for both brokers):


    [root at zg-dev-mq-002]# uname -a
    Linux zg-dev-mq-002.zettagrid.local 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


    [root at zg-dev-mq-002]# cat /etc/centos-release
    CentOS release 6.4 (Final)


    [root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
    esl-erlang.x86_64 R16B-2 @/esl-erlang-R16B-2.x86_64
    esl-erlang-compat.noarch R14B-1.el6 @/esl-erlang-compat-R14B-1.el6.noarch
    rabbitmq-server.noarch 3.1.1-1 @/rabbitmq-server-3.1.1-1.noarch


    Thanks very much,


    Nathanael


    ________________________________


    ZettaServe Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately if you have received this email by mistake and delete this email from your system. Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. ZettaServe Pty Ltd accepts no liability for any damage caused by any virus transmitted by this email.
  • Simon MacMullen at Jun 13, 2013 at 12:37 pm
    Hi. We've failed to reproduce this using a similar workload and also
    spent some time staring at the appropriate bits of code and not come up
    with anything.


    So to try to narrow things down a bit:


    1) Does the problem still occur if you disable automatic eager sync?
    (And don't eager sync manually either.)


    2) Can you provide the Mnesia directory and logs from a machine which
    has just failed?


    Cheers, Simon

    On 10/06/13 16:26, Rensen, Nathanael wrote:
    Thanks for taking the time to look at those logs.

    The purpose of the test is to confirm that the client application,
    server application, LVS and RMQ itself all perform properly during
    fail-over. There is certainly messaging activity during the test,
    although relatively modest by RMQ standards judging from the rates I
    see discussed on this forum.

    Of the 10 queues, three queues were operating with a combined average
    rate of around 200 messages per second from client app to server app.
    The publish rate will burst to instantaneous peaks possibly exceeding
    500 messages per second for a short time. A fourth queue is returning
    responses from server app to client app at somewhere around 300 to
    400 messages per second. A fifth queue is used for application
    heartbeats from server to client at around 1 message every 2 seconds.
    The remaining 5 queues were idle during the test. The messages are in
    the range of hundreds of bytes to a couple of kilobytes. Each message
    has a per-message TTL configured - 10 seconds for the application
    heartbeats, 90 seconds otherwise. -- When either client or server
    application detects a channel / connection shutdown it enters a
    reconnect cycle making two attempts to connect each second until it
    is able to re-establish connection and resume operation. Each
    application uses a single RMQ connection for consume and publish. All
    publishing happens on a single channel per app using publisher
    confirms.

    I am able to reliably reproduce an error of some kind by running the
    test for long enough (a couple of dozen fail-overs is typically
    enough, although sometimes fewer than 10). I haven't paid
    sufficiently close attention to be able to say whether the error is
    always the same. However the symptoms are not always exactly the
    same. Usually after an error has occurred one or other broker will
    refuse to shutdown gracefully. Sometimes queues vanish and I will
    have to reconfigure them.

    In order to gather a clean sample for the logs I posted I deleted
    /var/lib/rabbitmq/mnesia and reconfigured the cluster and queues from
    scratch. I then rebooted the nodes to ensure a clean starting point.

    Thanks again,

    Nathanael


    Simon MacMullen wrote:

    Hi thanks. This is definitely an odd looking error, can you tell us
    mre about what you're doing? Are you just starting / stopping nodes,
    or is there messaging activity going on at the same time (and if so,
    what?)

    Cheers, Simon


    On 10/06/13 11:32, Rensen, Nathanael wrote: I've attached the sasl
    log from mq-002. Sorry I didn't include that originally.

    Thanks for taking a look.

    Nathanael

    Simon MacMullen wrote:

    Hi. Looking at the logs it seems like the message store on mq-002
    crashed / shut down unexpectedly, but there's no information about
    this in the log. Do you have the corresponding sasl log?

    Cheers, Simon


    On 09/06/13 06:03, Rensen, Nathanael wrote: While testing a fail-over
    scenario with RabbitMQ 3.1.1 I have repeatedly encountered errors,
    sometimes resulting in durable queues vanishing.

    The cluster consists of two brokers using LVS / keepalived in order
    to connect clients to a functional broker. There are 10 mirrored
    queues, each of which has ha-sync-mode = automatic. A script is used
    to shut down one broker or the other in turn using 'service
    rabbitmq-server {start|stop}', such that there is always one broker
    running and leaving at least 30 seconds between each start / stop. I
    am expecting that this test should be able to run indefinitely
    without destabilising the cluster, however I have not been able to
    achieve more than a few dozen fail-overs without some error
    occurring. I'm hoping someone may have some insight or suggestions as
    to how to stabilise this environment.

    I have included basic environment details below and attached logs
    from both brokers showing one example. In this case zg-dev-mq-003 was
    stopped at 11:32:21 and went through what appears to be a clean
    shutdown:

    =INFO REPORT==== 9-Jun-2013::11:33:22 === Halting Erlang VM

    zg-dev-mq-002 detected the other broker down and promoted itself to
    master. Then after accepting connections from clients it logged an
    error as shown below:

    =INFO REPORT==== 9-Jun-2013::11:33:22 === rabbit on node
    'rabbit at zg-dev-mq-003' down =INFO REPORT==== 9-Jun-2013::11:33:22 ===
    accepting AMQP connection <0.427.0> (10.0.72.36:61434 ->
    172.17.0.73:5672) =INFO REPORT==== 9-Jun-2013::11:33:22 === accepting
    AMQP connection <0.430.0> (10.0.72.36:61435 -> 172.17.0.73:5672)
    =ERROR REPORT==== 9-Jun-2013::11:33:22 === ** Generic server
    <0.418.0> terminating ** Last message in was {'$gen_cast',
    {delete_and_terminate, {badarg, [{ets,insert_new, [360523,
    {{<<10,71,177,42,66,240,207,204,251,26,181,155, 246,83,172,137>>,
    <<120,196,170,245,109,158,126,84,92,250,21,193, 123,113,128,48>>},
    -1}], []}, {rabbit_msg_store,client_update_flying,3,[]},
    {rabbit_msg_store,'-remove/2-lc$^0/1-0-',2,[]},
    {rabbit_msg_store,remove,2,[]}, {rabbit_variable_queue,
    '-with_immutable_msg_store_state/3-fun-0-',2,[]},
    {rabbit_variable_queue,with_msg_store_state,3,[]},
    {rabbit_variable_queue, with_immutable_msg_store_state,3,[]},
    {rabbit_variable_queue,'-ack/2-lc$^0/1-0-',2, []}]}}} etc

    Environment details (same for both brokers):

    [root at zg-dev-mq-002]# uname -a Linux zg-dev-mq-002.zettagrid.local
    2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64
    x86_64 x86_64 GNU/Linux

    [root at zg-dev-mq-002]# cat /etc/centos-release CentOS release 6.4
    (Final)

    [root at zg-dev-mq-002]# yum list installed | egrep 'rabbit|erlang'
    esl-erlang.x86_64 R16B-2 @/esl-erlang-R16B-2.x86_64
    esl-erlang-compat.noarch R14B-1.el6
    @/esl-erlang-compat-R14B-1.el6.noarch rabbitmq-server.noarch 3.1.1-1
    @/rabbitmq-server-3.1.1-1.noarch

    Thanks very much,

    Nathanael

    ________________________________

    ZettaServe Disclaimer: This email and any files transmitted with it
    are confidential and intended solely for the use of the individual or
    entity to whom they are addressed. If you are not the named addressee
    you should not disseminate, distribute or copy this e-mail. Please
    notify the sender immediately if you have received this email by
    mistake and delete this email from your system. Computer viruses can
    be transmitted via email. The recipient should check this email and
    any attachments for the presence of viruses. ZettaServe Pty Ltd
    accepts no liability for any damage caused by any virus transmitted
    by this email.



    --
    Simon MacMullen
    RabbitMQ, Pivotal

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedJun 9, '13 at 5:03a
activeJun 13, '13 at 12:37p
posts6
users2
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase