Hi,


I am relatively new to Rabbitmq and would appreciate help in
troubleshooting a recurring issue on a cluster, apologies for the long
email.


I run a 3 instances cluster in ec2, 2 disk nodes (A and B) 1 ram node (C),
exchanges and queues are static and limited in number (less than 50), the
volume of messages can reach a few thousands per second and the queues can
occasionally grow up to a few hundred thousands until the processes manage
to catch up, but this is well within our memory/disk high watermark.
Rabbitmq is v. 2.8.4 on ubuntu 12.


The processes that produce and consume messages are java/scala and use the
akka-amqp library driver (akka-amqp-1.3.1).


In normal operations our nodes run on ~ 500MB memory, the limit is set to
2.7GB, but in a couple occasions one of the nodes managed to very rapidly
(< 1 min) eat > 1GB memory and triggered the high memory watermark (node B,
one of the disk nodes last time).


At the moment of the issue most of the memory was held in 'processes', no
queue was overgrown (messages or memory), I managed to run a rabbitmqctl
report, I found a crash report in the logs (more on this later) and tried
to see which process/es was responsible for the memory usage from the
erlang shell but I wasn't able to pin a specific one (my erlang is limited
and I didn't dump the processes to file to look at it later unfortunately).


For every issue we have had (3 in 2 months), the logs show a crash report
relative to a specific java process (say process D) sending a message to a
specific queue (queue D - this is a queue with no consumers atm, if the app
fails to process a message it will end up here). Note that there are
multiple workers D running on separate instances in ec2 and that queue D is
very low traffic so it is unlikely to be a coincidence.


At this point I was pressured to restart rmq on the affected node B, this
was a stop_app, then stop the erlang VM and then restart. All looked good,
the node was in a nice place and all looked green in the management
console.


Our java processes would connect, they'd do some work, but after a while
they'd throw a connection error (sorry this is all the details the logs
would show). Specifically a process E was connected to node A and would
show this random behaviour (different process and different rmq node from
the two involved in the issue).
After a while I was forced to restart the whole cluster, this was done
stopping app on the ram node C, then on the disk node B, then the other
node A. Unfortunately A seemed to hang and failed to stop within a
reasonable time (minutes), and I had to kill it (kill -15). I didn't see
any indication in the logs on why it was hanging.


When I attempted to start A it would fail as the port was already in use
and I found that a process still had an open connection to port 5672:


tcp6 0 164945 10.xx.xx.xx:5672 10.xx.xx.xx:53712 FIN_WAIT1


Note: This is a worker D but running on a different box from the one showed
in the crash log.


Once this connection was cleared everything came up nice and happy.


In Summary:


Process D was producing on queue D on node B, something happened and node B
managed to consume > 1GB memory in less than a minute. Note that this queue
is really low traffic but in every issue I have looked into I can see a
similar crash report as below.


After a restart of node B other processes connected to node A started
exhibiting connection issues, work a bit then break the connection until
the whole cluster is restarted.


I would like to better understand the crash report and perhaps have some
ideas on what went wrong and how to more effectively troubleshoot issues
(what more info should I collect before restarting the nodes, erlang
processes list, mnesia tables, ets tables etc).




=CRASH REPORT==== 14-Feb-2013::10:36:54 ===
crasher:
initial call: rabbit_reader:init/4
pid: <*0.29283.387*>
registered_name: []
exception error: bad argument
in function port_close/1
called as port_close(#Port<0.746540>)
in call from rabbit_net:maybe_fast_close/1
in call from rabbit_reader:start_connection/7
ancestors: [<0.29280.387>,rabbit_tcp_client_sup,rabbit_sup,<0.161.0>]
messages: []
links: [<0.29280.387>]
dictionary: [{{channel,10},
{<0.29364.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{ch_pid,<0.29338.387>},{7,#Ref<0.0.2158.60093>}},
{{ch_pid,<0.29333.387>},{6,#Ref<0.0.2158.60085>}},
{{ch_pid,<0.29325.387>},{5,#Ref<0.0.2158.60053>}},
{{channel,3},
{<0.29313.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{ch_pid,<0.29305.387>},{2,#Ref<0.0.2158.60002>}},
{{channel,4},
{<0.29321.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{channel,11},
{<0.29370.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{ch_pid,<0.29313.387>},{3,#Ref<0.0.2158.60017>}},
{{ch_pid,<0.29299.387>},{1,#Ref<0.0.2158.59976>}},
{{ch_pid,<0.29346.387>},{8,#Ref<0.0.2158.60112>}},
{{ch_pid,<0.29370.387>},{11,#Ref<0.0.2158.60189>}},
{{channel,7},
{<0.29338.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{channel,9},
{<0.29356.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{ch_pid,<0.29321.387>},{4,#Ref<0.0.2158.60034>}},
{{ch_pid,<0.29364.387>},{10,#Ref<0.0.2158.60166>}},
{{ch_pid,<0.29356.387>},{9,#Ref<0.0.2158.60140>}},
{{channel,8},
{<0.29346.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{channel,5},
{<0.29325.387>,{method,rabbit_framing_amqp_0_9_1}}},
{{channel,1},
{<0.29299.387>,
{content_body,
{'basic.publish',0,<<"some_exchange">>,<<>>,false,
false},
1048189,
{content,60,none,
<<BYTES IN HERE>>, --> this showed which
process was sending the message
rabbit_framing_amqp_0_9_1,
[<<MORE BYTES IN HERE>>] --> This I haven't
been able to decode, it is fairly big, is it truncated?




And in the logs we can find the pid *0.29283.387 *right before the crash:


=INFO REPORT==== 14-Feb-2013::10:31:46 ===
accepting AMQP connection <*0.29283.387*> (10.xx.xx.xx:58622 -> 10.xx.xx.xx
:5672)


=INFO REPORT==== 14-Feb-2013::10:31:46 ===
accepting AMQP connection <0.29287.387> (10.xx.xx.xx:58623 -> 10.xx.xx.xx
:5672)


=WARNING REPORT==== 14-Feb-2013::10:32:27 ===
closing AMQP connection <0.27107.387> (10.xx.xx.xx:50882 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly


=WARNING REPORT==== 14-Feb-2013::10:32:27 ===
closing AMQP connection <0.27111.387> (10.xx.xx.xx:50883 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly


=INFO REPORT==== 14-Feb-2013::10:33:35 ===
accepting AMQP connection <0.30447.387> (10.xx.xx.xx:50187 -> 10.xx.xx.xx
:5672)


=INFO REPORT==== 14-Feb-2013::10:33:35 ===
accepting AMQP connection <0.30451.387> (10.xx.xx.xx:50188 -> 10.xx.xx.xx
:5672)


=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <*0.29283.387*> (10.xx.xx.xx:58622 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly


=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.29287.387> (10.xx.xx.xx:58623 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly


=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.30451.387> (10.xx.xx.xx:50188 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly


=WARNING REPORT==== 14-Feb-2013::10:37:34 ===
closing AMQP connection <0.30447.387> (10.xx.xx.xx:50187 -> 10.xx.xx.xx
:5672):
connection_closed_abruptly


=INFO REPORT==== 14-Feb-2013::10:37:34 ===
vm_memory_high_watermark set. Memory used:3384669192 allowed:2914218803


Looking at the rabbitmqctl report I have not been able to map the memory
consumption to something specific yet.


Any help appreciated,


S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130221/e64d4bf7/attachment.htm>

Search Discussions

  • Jerry Kuch at Feb 22, 2013 at 8:48 pm
    Hi, Simone...


    I am relatively new to Rabbitmq and would appreciate help in
    troubleshooting a recurring issue on a cluster, apologies for the long
    email.

    No problem! Details are usually helpful...



    I run a 3 instances cluster in ec2, 2 disk nodes (A and B) 1 ram node (C),
    exchanges and queues are static and limited in number (less than 50), the
    volume of messages can reach a few thousands per second and the queues can
    occasionally grow up to a few hundred thousands until the processes manage
    to catch up, but this is well within our memory/disk high watermark.
    Rabbitmq is v. 2.8.4 on ubuntu 12.

    2.8.4 is now getting a bit old, and was from the middle of a sequence of
    bug fix releases, during which many things improved... you might want to
    consider upgrading. Before you do, see remarks below.



    I would like to better understand the crash report and perhaps have some
    ideas on what went wrong and how to more effectively troubleshoot issues
    (what more info should I collect before restarting the nodes, erlang
    processes list, mnesia tables, ets tables etc).

    =CRASH REPORT==== 14-Feb-2013::10:36:54 ===
    crasher:
    initial call: rabbit_reader:init/4
    pid: <*0.29283.387*>
    registered_name: []
    exception error: bad argument
    in function port_close/1
    called as port_close(#Port<0.746540>)
    in call from rabbit_net:maybe_fast_close/1
    in call from rabbit_reader:start_connection/7
    ancestors: [<0.29280.387>,rabbit_tcp_client_sup,rabbit_sup,<0.161.0>]
    messages: []
    links: [<0.29280.387>]
    dictionary: [{{channel,10},
    {<0.29364.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{ch_pid,<0.29338.387>},{7,#Ref<0.0.2158.60093>}},
    {{ch_pid,<0.29333.387>},{6,#Ref<0.0.2158.60085>}},
    {{ch_pid,<0.29325.387>},{5,#Ref<0.0.2158.60053>}},
    {{channel,3},
    {<0.29313.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{ch_pid,<0.29305.387>},{2,#Ref<0.0.2158.60002>}},
    {{channel,4},
    {<0.29321.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{channel,11},
    {<0.29370.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{ch_pid,<0.29313.387>},{3,#Ref<0.0.2158.60017>}},
    {{ch_pid,<0.29299.387>},{1,#Ref<0.0.2158.59976>}},
    {{ch_pid,<0.29346.387>},{8,#Ref<0.0.2158.60112>}},
    {{ch_pid,<0.29370.387>},{11,#Ref<0.0.2158.60189>}},
    {{channel,7},
    {<0.29338.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{channel,9},
    {<0.29356.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{ch_pid,<0.29321.387>},{4,#Ref<0.0.2158.60034>}},
    {{ch_pid,<0.29364.387>},{10,#Ref<0.0.2158.60166>}},
    {{ch_pid,<0.29356.387>},{9,#Ref<0.0.2158.60140>}},
    {{channel,8},
    {<0.29346.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{channel,5},
    {<0.29325.387>,{method,rabbit_framing_amqp_0_9_1}}},
    {{channel,1},
    {<0.29299.387>,
    {content_body,
    {'basic.publish',0,<<"some_exchange">>,<<>>,false,
    false},
    1048189,
    {content,60,none,
    <<BYTES IN HERE>>, --> this showed which
    process was sending the message
    rabbit_framing_amqp_0_9_1,
    [<<MORE BYTES IN HERE>>] --> This I haven't
    been able to decode, it is fairly big, is it truncated?

    Unfortunately, nothing springs to mind immediately to pursue from this...



    And in the logs we can find the pid *0.29283.387 *right before the crash:

    =INFO REPORT==== 14-Feb-2013::10:31:46 ===
    accepting AMQP connection <*0.29283.387*> (10.xx.xx.xx:58622 -> 10.
    xx.xx.xx:5672)

    =INFO REPORT==== 14-Feb-2013::10:31:46 ===
    accepting AMQP connection <0.29287.387> (10.xx.xx.xx:58623 -> 10.xx.xx.xx
    :5672)

    =WARNING REPORT==== 14-Feb-2013::10:32:27 ===
    closing AMQP connection <0.27107.387> (10.xx.xx.xx:50882 -> 10.xx.xx.xx
    :5672):
    connection_closed_abruptly

    This could be pretty much anything, from client misbehavior to connection
    disruption...


    Looking at the rabbitmqctl report I have not been able to map the memory
    consumption to something specific yet.

    I'd proceed with an upgrade to a more recent Rabbit... check here first:


    http://www.rabbitmq.com/blog/2012/11/19/breaking-things-with-rabbitmq-3-0/


    And if none of the changes in 3.0.x are going to provide you short term
    inconvenience, then try going straight to the latest 3.0.2; if there are
    3.0 changes that you think will bother you or require changes to your apps
    or infrastructure, then jump to 2.8.6 for now... it was the seventh and
    last of the 2.8.x series and contains a pile of incremental fixes that may
    help with this (otherwise tricky to diagnose from what we have available
    right now) problem.


    Best regards,
    Jerry
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130222/7e6e08aa/attachment.htm>
  • Matthias Radestock at Feb 22, 2013 at 9:20 pm

    On 22/02/13 20:48, Jerry Kuch wrote:
    =CRASH REPORT==== 14-Feb-2013::10:36:54 ===

    This could be pretty much anything, from client misbehavior to
    connection disruption...

    Indeed. And the resulting error report is spurious; it's essentially
    complaining that it can't close a connection because that connection is
    already dead. This was fixed in 2.8.7.


    Which neatly brings me to the next point...

    And if none of the changes in 3.0.x are going to provide you short term
    inconvenience, then try going straight to the latest 3.0.2; if there are
    3.0 changes that you think will bother you or require changes to your
    apps or infrastructure, then jump to 2.8.6 for now... it was the seventh
    and last of the 2.8.x series

    Erhm. 2.8.7 was the eighth and last. So go with that if 3.0.2 is not an
    option.


    Matthias.
  • Jerry Kuch at Feb 22, 2013 at 10:14 pm

    On Fri, Feb 22, 2013 at 1:20 PM, Matthias Radestock wrote:


    Which neatly brings me to the next point...

    And if none of the changes in 3.0.x are going to provide you short term
    inconvenience, then try going straight to the latest 3.0.2; if there are
    3.0 changes that you think will bother you or require changes to your
    apps or infrastructure, then jump to 2.8.6 for now... it was the seventh
    and last of the 2.8.x series
    Erhm. 2.8.7 was the eighth and last. So go with that if 3.0.2 is not an
    option.

    Ooops, yes, I stand corrected. 2.8.6 had lodged in my mind cracks because
    it was the base version used for the then contemporary vFabric commercial
    Rabbit release.


    Jerry
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130222/f74a8a56/attachment.htm>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedFeb 21, '13 at 8:41p
activeFeb 22, '13 at 10:14p
posts4
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase