FAQ
Hi,

I'm load testing my RabbitMQ setup and I'm having problems I can't get
my head around. After publishing a lot of messages (millions) it seems
like the cluster is put into a bad state. The performance completely
rots and all connections are marked as blocked -- and even when all
queues eventually have been drained (after I've shut down the
producers) and all connections closed at least one of the nodes is
still showing memory usage well above the high watermark. This has
happened multiple times during my testing, and seems completely
reproducible.

Please have a look at this screenshot:

https://skitch.com/iconara/fynbd/rabbitmq-management

(the screenshot shows the web based management console, zero messages
queued, but very high memory usage for one of the cluster nodes).

At the same time that I took the screenshot the Connections tab showed
no connections. It was several minutes (perhaps ten) since the last
connection was closed.

If I start a producer, or a consumer, at this point the connection is
immediately marked as blocked in the Connections tab, and the message
rate numbers on the Overview tab shows zero, even though my code is
reporting that it's sending thousands of messages (the number of
queued/ready messages increases though).

Removing all queues seems to resolve the problem, but that is not a
feasible workaround. It feels like I should be able to run the cluster
continuously without having to stop and clean up from time to time.

More specifics on the setup: the cluster consists of 4 EC2 instances
with 8 CPUs and 7 Gb RAM each (I forget the exact instance name)
running RabbitMQ 2.4. The producers and consumers are Ruby processed
running the latest RC of the AMQP gem. Each node has three queues
bound to one exchange with a single routing key each, the producers
connect to a random cluster node and publish "hello world" with a
random routing key, so that each message will end up in exactly one
queue. The consumers connect to one of the cluster nodes (one consumer
per cluster node, in this test setup), and subscribe to all of the
queues on that node. The consumers do nothing beside ack the message.
The idea behind the setup is to get load balancing, and high
throughput.

Before the cluster rots, we publish, deliver and ack 15-20K messages
per second.

Thanks in advance for any tips,
Theo

Search Discussions

  • Matthew Sackman at May 18, 2011 at 4:01 pm
    Hi Theo,
    On Wed, May 18, 2011 at 08:22:13AM -0700, Theo wrote:
    More specifics on the setup: the cluster consists of 4 EC2 instances
    with 8 CPUs and 7 Gb RAM each (I forget the exact instance name)
    running RabbitMQ 2.4.
    Is that 2.4.0 or 2.4.1? If not 2.4.1, please upgrade. Which version of
    Erlang are you running? Is there anything of note other than connections
    and memory alarms in any of the logs?

    Matthew
  • Theo at May 18, 2011 at 4:15 pm
    2.4.1, should have been more specific. the OS is ubuntu 10.04, btw.

    there's a lot of warnings in the log about memory usage, but it's more
    or less the same thing over and over again:

    =INFO REPORT==== 18-May-2011::14:37:20 ===
    vm_memory_high_watermark set. Memory used:3327376104 allowed:
    3006570496

    =INFO REPORT==== 18-May-2011::14:37:20 ===
    alarm_handler: {set,
    {{vm_memory_high_watermark,rabbit at rfmmqstaging01},[]}}

    then I found this, not sure if it's important (pasted it into a gist):

    https://gist.github.com/978897

    this is the version of erlang:

    Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:8:8] [rq:8] [async-
    threads:0] [hipe] [kernel-poll:false]

    I've been watching the management UI since I posted my first message
    (almost an hour ago) and the memory usage hasn't gone down. can a node
    get bad? the other nodes look ok, but the one that's in the red in the
    screenshot is still in the red. no connections to the server for an
    hour.

    T#
    On May 18, 6:01?pm, Matthew Sackman wrote:
    Hi Theo,
    On Wed, May 18, 2011 at 08:22:13AM -0700, Theo wrote:
    More specifics on the setup: the cluster consists of 4 EC2 instances
    with 8 CPUs and 7 Gb RAM each (I forget the exact instance name)
    running RabbitMQ 2.4.
    Is that 2.4.0 or 2.4.1? If not 2.4.1, please upgrade. Which version of
    Erlang are you running? Is there anything of note other than connections
    and memory alarms in any of the logs?

    Matthew
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Theo at May 18, 2011 at 8:47 pm
    Eventually (a couple of hours without any connections) the node
    returned to normal memory usage, so I launched a new batch of tests. I
    saw a sustained load of 20K published/delivered/acked for a few
    minutes, with the memory usage of each node increasing rapidly. After
    perhaps five minutes the cluster rotted completely. No messages
    published, nothing delivered, all connections are blocked.

    20K messages over four nodes, that shouldn't be a problem? From my
    point of view it looks like RabbitMQ is leaking memory badly, but that
    couldn't be, could it?

    At this point there is no message on any queue, according to the web
    UI, but each node is still using over 3Gb RAM.

    Theo
    On May 18, 6:15?pm, Theo wrote:
    2.4.1, should have been more specific. the OS is ubuntu 10.04, btw.

    there's a lot of warnings in the log about memory usage, but it's more
    or less the same thing over and over again:

    =INFO REPORT==== 18-May-2011::14:37:20 ===
    vm_memory_high_watermark set. Memory used:3327376104 allowed:
    3006570496

    =INFO REPORT==== 18-May-2011::14:37:20 ===
    ? ? alarm_handler: {set,
    {{vm_memory_high_watermark,rabbit at rfmmqstaging01},[]}}

    then I found this, not sure if it's important (pasted it into a gist):

    https://gist.github.com/978897

    this is the version of erlang:

    Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:8:8] [rq:8] [async-
    threads:0] [hipe] [kernel-poll:false]

    I've been watching the management UI since I posted my first message
    (almost an hour ago) and the memory usage hasn't gone down. can a node
    get bad? the other nodes look ok, but the one that's in the red in the
    screenshot is still in the red. no connections to the server for an
    hour.

    T#

    On May 18, 6:01?pm, Matthew Sackman wrote:> Hi Theo,
    On Wed, May 18, 2011 at 08:22:13AM -0700, Theo wrote:
    More specifics on the setup: the cluster consists of 4 EC2 instances
    with 8 CPUs and 7 Gb RAM each (I forget the exact instance name)
    running RabbitMQ 2.4.
    Is that 2.4.0 or 2.4.1? If not 2.4.1, please upgrade. Which version of
    Erlang are you running? Is there anything of note other than connections
    and memory alarms in any of the logs?
    Matthew
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Theo at May 19, 2011 at 5:35 am
    I rewrote my test code in Java, just to make sure there's nothing with
    the Ruby driver that is causing the problem, here is the code:
    https://gist.github.com/980238

    I ran that with more or less the same result. After a few minutes at
    20K publishes/deliveries/acks per second the cluster rotted and all
    connections got blocked. What's more, this time I saw that if I turned
    off my producers the web UI still reported messages being published
    for several minutes -- no producers were running (I'm absolutely sure
    the processes were terminated) but there were still messages being
    published.

    Theo
    On May 18, 10:47?pm, Theo wrote:
    Eventually (a couple of hours without any connections) the node
    returned to normal memory usage, so I launched a new batch of tests. I
    saw a sustained load of 20K published/delivered/acked for a few
    minutes, with the memory usage of each node increasing rapidly. After
    perhaps five minutes the cluster rotted completely. No messages
    published, nothing delivered, all connections are blocked.

    20K messages over four nodes, that shouldn't be a problem? From my
    point of view it looks like RabbitMQ is leaking memory badly, but that
    couldn't be, could it?

    At this point there is no message on any queue, according to the web
    UI, but each node is still using over 3Gb RAM.

    Theo

    On May 18, 6:15?pm, Theo wrote:






    2.4.1, should have been more specific. the OS is ubuntu 10.04, btw.
    there's a lot of warnings in the log about memory usage, but it's more
    or less the same thing over and over again:
    =INFO REPORT==== 18-May-2011::14:37:20 ===
    vm_memory_high_watermark set. Memory used:3327376104 allowed:
    3006570496
    =INFO REPORT==== 18-May-2011::14:37:20 ===
    ? ? alarm_handler: {set,
    {{vm_memory_high_watermark,rabbit at rfmmqstaging01},[]}}
    then I found this, not sure if it's important (pasted it into a gist):
    https://gist.github.com/978897
    this is the version of erlang:
    Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:8:8] [rq:8] [async-
    threads:0] [hipe] [kernel-poll:false]
    I've been watching the management UI since I posted my first message
    (almost an hour ago) and the memory usage hasn't gone down. can a node
    get bad? the other nodes look ok, but the one that's in the red in the
    screenshot is still in the red. no connections to the server for an
    hour.
    T#
    On May 18, 6:01?pm, Matthew Sackman wrote:> Hi Theo,
    On Wed, May 18, 2011 at 08:22:13AM -0700, Theo wrote:
    More specifics on the setup: the cluster consists of 4 EC2 instances
    with 8 CPUs and 7 Gb RAM each (I forget the exact instance name)
    running RabbitMQ 2.4.
    Is that 2.4.0 or 2.4.1? If not 2.4.1, please upgrade. Which version of
    Erlang are you running? Is there anything of note other than connections
    and memory alarms in any of the logs?
    Matthew
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthew Sackman at May 19, 2011 at 7:41 am
    Hi Theo,

    R13B03 is a little old, but I think should still be fine.
    Hmm, curious. As a first change, could you try removing all plugins and
    trying again please?

    Matthew
  • Theo at May 19, 2011 at 8:08 am
    we only have the requirements for the management UI installed, should
    I remove these? I don't know how to monitor the performance of the
    cluster without the management UI. I know of rabbitmqctl, but is there
    a way to see publishes/deliveries/acks per second in the same way as
    you can in the UI?

    these are the files in the plugins directory, from what I can tell
    they correspond exactly to the requirements of the management UI:

    amqp_client-2.4.1.ez
    mochiweb-2.4.1.ez
    rabbitmq-management-2.4.1.ez
    rabbitmq-management-agent-2.4.1.ez
    rabbitmq-mochiweb-2.4.1.ez
    webmachine-2.4.1.ez

    Theo

    On May 19, 9:41?am, Matthew Sackman wrote:
    Hi Theo,

    R13B03 is a little old, but I think should still be fine.
    Hmm, curious. As a first change, could you try removing all plugins and
    trying again please?

    Matthew
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthias Radestock at May 19, 2011 at 9:28 am
    Theo,
    On 19/05/11 06:35, Theo wrote:
    I rewrote my test code in Java, just to make sure there's nothing with
    the Ruby driver that is causing the problem, here is the code:
    https://gist.github.com/980238

    I ran that with more or less the same result. After a few minutes at
    20K publishes/deliveries/acks per second the cluster rotted and all
    connections got blocked. What's more, this time I saw that if I turned
    off my producers the web UI still reported messages being published
    for several minutes -- no producers were running (I'm absolutely sure
    the processes were terminated) but there were still messages being
    published.
    The figure you see reported in the management UI is the rate at which
    rabbit processes these messages. The publisher code publishes messages
    as fast as it can, which is a far higher rate than rabbit can process
    them. Hence rabbit is buffering the messages in order to try to keep up.
    That way rabbit can handle brief periods of high publishing rate.

    However, if publishing continues at a high rate then eventually the
    buffers fill up all memory. At that point rabbit blocks further
    publishes until the buffers have been drained sufficiently to allow
    publishing to resume. This can take a while since rabbit has to page the
    messages to disk in order to free up space.

    Moreover, as soon as the publishers are unblocked, if they resume
    publishing at the high rate, as is the case in your test, the memory
    limit will be hit again quickly. Hence you will see rabbit "bouncing off
    the limiter", where publishers get unblocked briefly and then blocked again.


    This is all quite normal.


    Regards,

    Matthias.
  • Theo at May 19, 2011 at 10:00 am
    Ok, thanks for the clarification, it makes sense, and it's more or
    less what we figured. I was hoping there was another explanation, that
    would be fixable.

    What kind of performance should we expect when running on EC2? It
    feels like 20K messages per second shouldn't be a problem for a four
    node cluster where each node has 8 CPUs and 7 Gb RAM. to get any kind
    of performance and stability we have had to tune how we work with the
    cluster, making sure that queues are distributed over the nodes, that
    each node has more than one queue, that the connections from the
    producers and consumers are evenly distributed over the nodes, etc.
    even slight asymmetries show up quite quickly in the memory usage of a
    node, and as soon as that happens it's only minutes before the whole
    cluster goes bad. Topic exchanges seem to be completely out of the
    question, it only takes minutes before every node gets overloaded.

    With a direct exchange and extreme attention to symmetry we have
    achieved a sustained publish/deliver/ack load of 40K/s for as long as
    we ran the test, and we can probably live with that. The idea was to
    use a topic exchange, but at this point it seems like that is
    completely out of the question, but we can probably move that
    functionality into the application (just like we had to move load
    balancing into the application because queues are bound to CPUs).

    Theo
    On May 19, 11:28?am, Matthias Radestock wrote:
    Theo,
    On 19/05/11 06:35, Theo wrote:

    I rewrote my test code in Java, just to make sure there's nothing with
    the Ruby driver that is causing the problem, here is the code:
    https://gist.github.com/980238
    I ran that with more or less the same result. After a few minutes at
    20K publishes/deliveries/acks per second the cluster rotted and all
    connections got blocked. What's more, this time I saw that if I turned
    off my producers the web UI still reported messages being published
    for several minutes -- no producers were running (I'm absolutely sure
    the processes were terminated) but there were still messages being
    published.
    The figure you see reported in the management UI is the rate at which
    rabbit processes these messages. The publisher code publishes messages
    as fast as it can, which is a far higher rate than rabbit can process
    them. Hence rabbit is buffering the messages in order to try to keep up.
    That way rabbit can handle brief periods of high publishing rate.

    However, if publishing continues at a high rate then eventually the
    buffers fill up all memory. At that point rabbit blocks further
    publishes until the buffers have been drained sufficiently to allow
    publishing to resume. This can take a while since rabbit has to page the
    messages to disk in order to free up space.

    Moreover, as soon as the publishers are unblocked, if they resume
    publishing at the high rate, as is the case in your test, the memory
    limit will be hit again quickly. Hence you will see rabbit "bouncing off
    the limiter", where publishers get unblocked briefly and then blocked again.

    This is all quite normal.

    Regards,

    Matthias.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthias Radestock at May 19, 2011 at 10:19 am
    Theo,
    On 19/05/11 11:00, Theo wrote:
    Ok, thanks for the clarification, it makes sense, and it's more or
    less what we figured. I was hoping there was another explanation,
    that would be fixable.

    What kind of performance should we expect when running on EC2? It
    feels like 20K messages per second shouldn't be a problem for a four
    node cluster where each node has 8 CPUs and 7 Gb RAM.
    20kHz shouldn't be a problem even for just a single 4-core machine,
    assuming messages sizes are fairly small.
    to get any kind of performance and stability we have had to tune how
    we work with the cluster, making sure that queues are distributed
    over the nodes, that each node has more than one queue, that the
    connections from the producers and consumers are evenly distributed
    over the nodes, etc. even slight asymmetries show up quite quickly
    in the memory usage of a node, and as soon as that happens it's only
    minutes before the whole cluster goes bad.
    You shouldn't have to do that kind of tuning to get 20kHz. As I said in
    your previous email, your publishers publish at a rate far higher than
    20kHz. And they do that all the time, not just for some brief peak
    period. That's why rabbit eventually fills up all memory with buffered
    messages and has to pause producers. If you run tests with throttled
    producer, e.g. get producers to publish messages at a certain rate, then
    rabbit will happily keep up with that rate forever.
    Topic exchanges seem to be completely out of the question, it only
    takes minutes before every node gets overloaded.
    Again, that is due to the fact that the publishers operate at a far
    higher rate than is sustainable.

    Regards,

    Matthias.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedMay 18, '11 at 3:22p
activeMay 19, '11 at 10:19a
posts10
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2022 Grokbase