Hello,

I've been performance/stress testing RabbitMQ in our perf lab. We've
found that we can consistently crash RabbitMQ's erlang node (running
on Solaris) when using 2 or more producers, which send messages as
fast as they can, with 2 or more consumers. The crashes occur during
memory allocation.

I found an article that talks about flow control in RabbitMQ 1.5.0
[1]. It doesn't seem to work in our 1.5.3 setup. When I start rabbit
from erl, I see that the memsup app is not started. I checked the
rabbitmq-server configuration and see "-os_mon start_memsup false".
I've tried setting start_memsup to true and I've also tried starting
memsup manually before calling rabbit:start(). Neither seems to cause
flow control information to be logged and both still result in the
erlang node crashing.

I very strongly suspect user error, but I'd appreciate some guidance
on how to enable this feature.


[1]: http://hopper.squarespace.com/blog/2008/11/9/flow-control-in-rabbitmq.html

Thanks,
Chris

Search Discussions

  • Matthias Radestock at Mar 23, 2009 at 8:14 pm
    Chris,

    Chris Pettitt wrote:
    I found an article that talks about flow control in RabbitMQ 1.5.0
    [1]. It doesn't seem to work in our 1.5.3 setup. When I start rabbit
    from erl, I see that the memsup app is not started. I checked the
    rabbitmq-server configuration and see "-os_mon start_memsup false".
    I've tried setting start_memsup to true and I've also tried starting
    memsup manually before calling rabbit:start(). Neither seems to cause
    flow control information to be logged and both still result in the
    erlang node crashing.

    I very strongly suspect user error, but I'd appreciate some guidance
    on how to enable this feature.
    See http://www.rabbitmq.com/admin-guide.html#memsup - you are supposed
    to be using "-rabbit memory_alarms true". Enabling memsup the way you
    did should be ok too though as long the server isn't low on memory to
    start with and you wait at least a minute before stressing it. You may
    need to tweak the threshold. Also, we don't know whether memsup on
    Solaris is producing the right information, which is why rabbit leaves
    it turned off by default on that platform. So if you can do some
    testing/investigation that would be great.


    Matthias.
  • Chris Pettitt at Mar 23, 2009 at 9:31 pm
    Hi Matthias,

    Thank you for the pointer. I should have looked more closely at the
    documentation!

    Cursory testing reveals that RabbitMQ is able to use flow control
    without crashing with a high watermark of 0.7, while 0.75 or higher
    results in the Erlang node crashing (before memsup detects the low
    memory condition?). I'll follow up if I learn anything more about
    memsup's behavior on Solaris.

    Thanks,
    Chris
    On Mon, Mar 23, 2009 at 1:14 PM, Matthias Radestock wrote:
    Chris,

    Chris Pettitt wrote:
    I found an article that talks about flow control in RabbitMQ 1.5.0
    [1]. It doesn't seem to work in our 1.5.3 setup. When I start rabbit
    from erl, I see that the memsup app is not started. I checked the
    rabbitmq-server configuration and see "-os_mon start_memsup false".
    I've tried setting start_memsup to true and I've also tried starting
    memsup manually before calling rabbit:start(). Neither seems to cause
    flow control information to be logged and both still result in the
    erlang node crashing.

    I very strongly suspect user error, but I'd appreciate some guidance
    on how to enable this feature.
    See http://www.rabbitmq.com/admin-guide.html#memsup - you are supposed to be
    using "-rabbit memory_alarms true". ?Enabling memsup the way you did should
    be ok too though as long the server isn't low on memory to start with and
    you wait at least a minute before stressing it. You may need to tweak the
    threshold. Also, we don't know whether memsup on Solaris is producing the
    right information, which is why rabbit leaves it turned off by default on
    that platform. So if you can do some testing/investigation that would be
    great.


    Matthias.
  • Chris Pettitt at Mar 31, 2009 at 10:36 pm
    Hi all,

    I found that Solaris continued to crash regardless of what settings I
    used for the high watermark. As memsup on Solaris was questionable I
    moved to Linux.

    I moved to a Quad core CentOS box with 9 GB of memory. Unfortunately,
    I'm seeing similar crashes, though they take longer to occur.

    What I'm seeing is usually some variation of the following:

    1. Start broker clean
    2. Start 1 consumer
    3. Start 10 producers that publish as fast as they can (we're trying
    to stress the system)
    4. Once system memory reaches the high water mark throttling occurs
    (I've tried settings from about 40% - 95%, the rest of the
    observations will refer to measurements at 70%).
    5. Throttling toggles on and off a few times, between 3 - 5 times, and
    then all clients (including the consumer) get disconnected.
    6. Memory sores to over 90%.
    6. Sometimes the erlang process crashes at this point, other times all
    producers and consumers reconnect and within about 30 seconds the
    erlang process crashes. In most cases the producers never produce
    after the reconnect, but on some occasions the consumer does receive
    messages before dying.

    I understand that this is pretty excessive in terms of stress.
    However, without going into details, it is very important that I
    demonstrate that RabbitMQ degrades gracefully under high load.

    On a more positive note, I'm seeing RabbitMQ outperform my current JMS
    provider by a factor of ~15x on Solaris!

    Any help would be appreciated.

    Thanks,
    Chris

    On Mon, Mar 23, 2009 at 2:31 PM, Chris Pettitt wrote:
    Hi Matthias,

    Thank you for the pointer. I should have looked more closely at the
    documentation!

    Cursory testing reveals that RabbitMQ is able to use flow control
    without crashing with a high watermark of 0.7, while 0.75 or higher
    results in the Erlang node crashing (before memsup detects the low
    memory condition?). I'll follow up if I learn anything more about
    memsup's behavior on Solaris.

    Thanks,
    Chris
    On Mon, Mar 23, 2009 at 1:14 PM, Matthias Radestock wrote:
    Chris,

    Chris Pettitt wrote:
    I found an article that talks about flow control in RabbitMQ 1.5.0
    [1]. It doesn't seem to work in our 1.5.3 setup. When I start rabbit
    from erl, I see that the memsup app is not started. I checked the
    rabbitmq-server configuration and see "-os_mon start_memsup false".
    I've tried setting start_memsup to true and I've also tried starting
    memsup manually before calling rabbit:start(). Neither seems to cause
    flow control information to be logged and both still result in the
    erlang node crashing.

    I very strongly suspect user error, but I'd appreciate some guidance
    on how to enable this feature.
    See http://www.rabbitmq.com/admin-guide.html#memsup - you are supposed to be
    using "-rabbit memory_alarms true". ?Enabling memsup the way you did should
    be ok too though as long the server isn't low on memory to start with and
    you wait at least a minute before stressing it. You may need to tweak the
    threshold. Also, we don't know whether memsup on Solaris is producing the
    right information, which is why rabbit leaves it turned off by default on
    that platform. So if you can do some testing/investigation that would be
    great.


    Matthias.
  • Matthias Radestock at Mar 31, 2009 at 11:22 pm
    Chris,

    Chris Pettitt wrote:
    What I'm seeing is usually some variation of the following:

    1. Start broker clean
    2. Start 1 consumer
    3. Start 10 producers that publish as fast as they can (we're trying
    to stress the system)
    4. Once system memory reaches the high water mark throttling occurs
    (I've tried settings from about 40% - 95%, the rest of the
    observations will refer to measurements at 70%).
    5. Throttling toggles on and off a few times, between 3 - 5 times, and
    then all clients (including the consumer) get disconnected.
    6. Memory sores to over 90%.
    6. Sometimes the erlang process crashes at this point, other times all
    producers and consumers reconnect and within about 30 seconds the
    erlang process crashes. In most cases the producers never produce
    after the reconnect, but on some occasions the consumer does receive
    messages before dying.
    I have tried to reproduce the above by running the
    com.rabbitmq.examples.MulticastMain test with the args "-a -x 10 -s
    1024", and while I can make RabbitMQ crash with the default 95% limit,
    once I lowered the limit to 70% it kept going.

    That's on an old-ish machine with only 4GB of memory though. It's
    conceivable that a faster machine with more memory requires more of a
    margin. 40%, which is the lowest limit you tried, is quite low, but it
    may be worth trying even lower limits.

    Can you send us the code for your tests? I'd like to gain a better
    understanding of exactly what parts of RabbitMQ get stressed, and how
    your test differs from what MulticastMain with the above params does.


    Also, what version of Erlang are you running?


    Regards,

    Matthias.
  • Chris Pettitt at Apr 1, 2009 at 5:07 pm
    Matthias,

    I tried MulticastMain with almost the same settings (except we're
    using 2K messages) and a high water mark of 70%. With this
    configuration, RabbitMQ stayed up for for much longer and I didn't see
    the erlang node crash.

    The difference in my producer is that it uses persistence. I see the
    same crash (at 70% high water mark), using these settings with
    MulticastMain: -a -x 10 -s 2048 -f persistent

    Is persistent messaging compatible/supported with producer flow control?

    Thanks for your help! If any specific observations, instrumentation,
    logging, etc would be helpful, please let me know.

    Thanks,
    Chris
    On Tue, Mar 31, 2009 at 4:22 PM, Matthias Radestock wrote:
    Chris,

    I have tried to reproduce the above by running the
    com.rabbitmq.examples.MulticastMain test with the args "-a -x 10 -s 1024",
    and while I can make RabbitMQ crash with the default 95% limit, once I
    lowered the limit to 70% it kept going.

    That's on an old-ish machine with only 4GB of memory though. It's
    conceivable that a faster machine with more memory requires more of a
    margin. 40%, which is the lowest limit you tried, is quite low, but it may
    be worth trying even lower limits.

    Can you send us the code for your tests? I'd like to gain a better
    understanding of exactly what parts of RabbitMQ get stressed, and how your
    test differs from what MulticastMain with the above params does.


    Also, what version of Erlang are you running?


    Regards,

    Matthias.
  • Chris Pettitt at Apr 1, 2009 at 5:24 pm
    Some additional data that may be helpful:

    At a 40% high water mark with "-a -x 10 -s 2048 -f persistent" the
    broker hangs after a few toggles of the flow control. Memory usage
    sits at about %59 and the queue is empty.

    Normally I reconnect the clients to the same queue and the broker
    crashes. With MulticastMain I can't do that (it creates an exclusive
    queue), so I just run it again with the same options. At this point
    nothing happens, probably because flow control is still set and memory
    is never coming below the 40% high water mark. However, it does not
    crash.

    - Chris
    On Wed, Apr 1, 2009 at 10:07 AM, Chris Pettitt wrote:
    Matthias,

    I tried MulticastMain with almost the same settings (except we're
    using 2K messages) and a high water mark of 70%. With this
    configuration, RabbitMQ stayed up for for much longer and I didn't see
    the erlang node crash.

    The difference in my producer is that it uses persistence. I see the
    same crash (at 70% high water mark), using these settings with
    MulticastMain: -a -x 10 -s 2048 -f persistent

    Is persistent messaging compatible/supported with producer flow control?

    Thanks for your help! If any specific observations, instrumentation,
    logging, etc would be helpful, please let me know.

    Thanks,
    Chris
    On Tue, Mar 31, 2009 at 4:22 PM, Matthias Radestock wrote:
    Chris,

    I have tried to reproduce the above by running the
    com.rabbitmq.examples.MulticastMain test with the args "-a -x 10 -s 1024",
    and while I can make RabbitMQ crash with the default 95% limit, once I
    lowered the limit to 70% it kept going.

    That's on an old-ish machine with only 4GB of memory though. It's
    conceivable that a faster machine with more memory requires more of a
    margin. 40%, which is the lowest limit you tried, is quite low, but it may
    be worth trying even lower limits.

    Can you send us the code for your tests? I'd like to gain a better
    understanding of exactly what parts of RabbitMQ get stressed, and how your
    test differs from what MulticastMain with the above params does.


    Also, what version of Erlang are you running?


    Regards,

    Matthias.
  • Matthias Radestock at Apr 1, 2009 at 6:03 pm
    Chris,

    Chris Pettitt wrote:
    At a 40% high water mark with "-a -x 10 -s 2048 -f persistent" the
    broker hangs after a few toggles of the flow control. Memory usage
    sits at about %59 and the queue is empty.
    That may well be due to another known problem with the persister: it can
    hang on to memory for too long. We fixed that a few weeks ago and the
    fix will be part of the next major release. Try out a recent snapshot in
    the meantime - build instructions are at
    http://www.rabbitmq.com/build-server.html


    Regards,

    Matthias.
  • Matthias Radestock at Apr 1, 2009 at 5:53 pm
    Chris,

    Chris Pettitt wrote:
    The difference in my producer is that it uses persistence.
    Ah, I thought that might be the case. There is a known performance
    limitation in the persister: In some scenarios RabbitMQ will end up
    spending most of its time writing new snapshots. That's because the time
    it takes to write a snapshot is proportional to the amount of persisted
    data, and snapshots are written every 500 entries (i.e. publishes,
    deliveries, acks) by default. Writing new snapshots is also very memory
    intensive.

    Work is underway to fix this, but it's going to take some time.
    Is persistent messaging compatible/supported with producer flow control?
    Yes, but because of the stress caused by the persister, the limits have
    to be set very low.


    Matthias.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedMar 23, '09 at 6:45p
activeApr 1, '09 at 6:03p
posts9
users2
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase