http://www.rabbitmq.com/memory.html

I'm running into some problems that I'm hoping that someone here can
help me with. Let me first state my setup. I'm running the latest
release of RabbitMQ on CentOS (64bit) in a clustered configuration
with 2 nodes in my Production environment. In my development
environment I only have 1 node though. My code is written in C# and
is using the downloaded SDK from the website. The memory
configuration value is set to 40% as the default.

I ran into a problem in my Production environment a week ago where
work was building up on my queue faster than my consumers were
processing it. Unfortunately I wasnt able to perform any debugging or
metrics gathering before the system was recycled or the queue was
purged. I'm still not exactly sure exactly which happened. But what
I can tell you is that it looked to me like exceptions were occurring
for both the publishers and consumers and nothing was really
happening. I have since added more consumers and have not had the
problem occur again. But this obviously has me concerned.

I am currently in the process of trying to reproduce this problem in
my development environment, but I'm running into some confusing
results. My test case is to run multiple publishers sending messages
over and over as fast as possible while also have a single consumer
processing those messages with a delay. The goal is to obviously
force messages to pile up in memory on the node to trigger the memory
alert. Since I have both publishers and consumers connected, I'm
expecting to see the consumers begin to get some sort of exception
saying that they cant submit anymore work while the consumer continues
to process. But what I'm actually seeing is that the publishers
continue piling on and the consumer continues to process, but the
machine eventually runs out of disk space and crashes.

Is there anyone that can advise me on what I'm doing wrong or help me
figure out what changes I can make?

Thanks
Andy

Search Discussions

  • Steve Powell at Dec 22, 2011 at 6:31 pm
    Andy,

    By the 'lastest release' I presume you mean 2.7.0 (a week ago that would be
    true, but now we have released 2.7.1.)

    Please can you show us your rabbitmq log after a crash? The test environment
    case would be interesting, though the production system is probably
    experiencing issues which are of an application nature.

    If the consumers were failing (getting exceptions) for some application reason,
    and they were responsible for acknowledging the messages, then it is entirely
    likely that the messages they failed to process are being re-queued, and the
    queue is building up without being drained. The application exceptions are
    therefore very interesting, and you should take care that a consumer should
    acknowledge the message WHEN IT HAS BEEN DEALT WITH -- even if that means it
    was an error that has been logged/passed on, or whatever.

    In the latest release message re-queuing preserves the order (for a single
    consumer) so a failing message might reappear in the queue at the head -- this
    would cause it to be re-processed more-or-less straight away, and if it is a
    message payload logical error of some kind, this is likely to fail again --
    and so on. The previous releases did not try to preserve message order, and
    so the failing messages could be overtaken by non-failing ones. This would not
    show up as a bottle-neck during high-load.

    I'm interested in your RAM configuration. It is entirely possible for rabbitmq
    to run out of memory even if there is a threshold set. Continual high
    publication rates, especially with new publishers all the time, will not be
    blocked entirely even then. This might mean that the test you ran is giving
    you misleading information.

    When the memory piles up you could also issue a rabbitmqctl report, which
    should tell us the general situation.

    Steve Powell (a perplexed bunny)
    ----------some more definitions from the SPD----------
    avoirdupois (phr.) 'Would you like peas with that?'
    distribute (v.) To denigrate an award ceremony.
    definite (phr.) 'It's hard of hearing, I think.'
    modest (n.) The most mod.
    On 22 Dec 2011, at 16:12, AndyB wrote:

    http://www.rabbitmq.com/memory.html

    I'm running into some problems that I'm hoping that someone here can
    help me with. Let me first state my setup. I'm running the latest
    release of RabbitMQ on CentOS (64bit) in a clustered configuration
    with 2 nodes in my Production environment. In my development
    environment I only have 1 node though. My code is written in C# and
    is using the downloaded SDK from the website. The memory
    configuration value is set to 40% as the default.

    I ran into a problem in my Production environment a week ago where
    work was building up on my queue faster than my consumers were
    processing it. Unfortunately I wasnt able to perform any debugging or
    metrics gathering before the system was recycled or the queue was
    purged. I'm still not exactly sure exactly which happened. But what
    I can tell you is that it looked to me like exceptions were occurring
    for both the publishers and consumers and nothing was really
    happening. I have since added more consumers and have not had the
    problem occur again. But this obviously has me concerned.

    I am currently in the process of trying to reproduce this problem in
    my development environment, but I'm running into some confusing
    results. My test case is to run multiple publishers sending messages
    over and over as fast as possible while also have a single consumer
    processing those messages with a delay. The goal is to obviously
    force messages to pile up in memory on the node to trigger the memory
    alert. Since I have both publishers and consumers connected, I'm
    expecting to see the consumers begin to get some sort of exception
    saying that they cant submit anymore work while the consumer continues
    to process. But what I'm actually seeing is that the publishers
    continue piling on and the consumer continues to process, but the
    machine eventually runs out of disk space and crashes.

    Is there anyone that can advise me on what I'm doing wrong or help me
    figure out what changes I can make?

    Thanks
    Andy
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • AndyB at Dec 22, 2011 at 7:35 pm
    Thanks for the quick reply guys. With the help of one of my systems
    guys, I think that we have made some progress on this issue. After
    performing further tests with my application and doing some close
    monitoring on the server node itself. We noticed some unexpected
    messages being written to disk for the queue which is transient. That
    immediately confused us. And after some creative google'ing we came
    across the following page which helped us a great deal ...
    http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2011-October/015793.html.
    Given that information and some tweaks to the tests, I was finally
    able to reproduce the memory throttling of publishers that I was
    expecting to see. And I think that I have a better understanding now
    for the RAM usage and statistics and their tie to the watermark value.

    But this gets me to my next concern ... The throttling of the
    publishers appears to be a blocking operation from within the
    "BasicPublish" method and as best I can tell, I'm not seeing any sort
    of timeout. This indefinite blocking would be pretty bad if it were
    to occur in my production environment. Is there a way that I can
    specify some sort of timeout for the blocking operation? I see an
    overload for "BasicPublish" which has boolean params for "immediate"
    and "mandatory". Are either of those meant to help in this situation?

    Thanks
    Andy
    On Dec 22, 1:31?pm, Steve Powell wrote:
    Andy,

    By the 'lastest release' I presume you mean 2.7.0 (a week ago that would be
    true, but now we have released 2.7.1.)

    Please can you show us your rabbitmq log after a crash? The test environment
    case would be interesting, though the production system is probably
    experiencing issues which are of an application nature.

    If the consumers were failing (getting exceptions) for some application reason,
    and they were responsible for acknowledging the messages, then it is entirely
    likely that the messages they failed to process are being re-queued, and the
    queue is building up without being drained. ?The application exceptions are
    therefore very interesting, and you should take care that a consumer should
    acknowledge the message WHEN IT HAS BEEN DEALT WITH -- even if that means it
    was an error that has been logged/passed on, or whatever.

    In the latest release message re-queuing preserves the order (for a single
    consumer) so a failing message might reappear in the queue at the head -- this
    would cause it to be re-processed more-or-less straight away, and if it is a
    message payload logical error of some kind, this is likely to fail again --
    and so on. ?The previous releases did not try to preserve message order, and
    so the failing messages could be overtaken by non-failing ones. ?This would not
    show up as a bottle-neck during high-load.

    I'm interested in your RAM configuration. ?It is entirely possible for rabbitmq
    to run out of memory even if there is a threshold set. ?Continual high
    publication rates, especially with new publishers all the time, will not be
    blocked entirely even then. This might mean that the test you ran is giving
    you misleading information.

    When the memory piles up you could also issue a rabbitmqctl report, which
    should tell us the general situation.

    Steve Powell ?(a perplexed bunny)
    ----------some more definitions from the SPD----------
    avoirdupois (phr.) 'Would you like peas with that?'
    distribute (v.) To denigrate an award ceremony.
    definite (phr.) 'It's hard of hearing, I think.'
    modest (n.) The most mod.

    On 22 Dec 2011, at 16:12, AndyB wrote:








    http://www.rabbitmq.com/memory.html
    I'm running into some problems that I'm hoping that someone here can
    help me with. ?Let me first state my setup. ?I'm running the latest
    release of RabbitMQ on CentOS (64bit) in a clustered configuration
    with 2 nodes in my Production environment. ?In my development
    environment I only have 1 node though. ?My code is written in C# and
    is using the downloaded SDK from the website. ?The memory
    configuration value is set to 40% as the default.
    I ran into a problem in my Production environment a week ago where
    work was building up on my queue faster than my consumers were
    processing it. ?Unfortunately I wasnt able to perform any debugging or
    metrics gathering before the system was recycled or the queue was
    purged. ?I'm still not exactly sure exactly which happened. ?But what
    I can tell you is that it looked to me like exceptions were occurring
    for both the publishers and consumers and nothing was really
    happening. ?I have since added more consumers and have not had the
    problem occur again. ?But this obviously has me concerned.
    I am currently in the process of trying to reproduce this problem in
    my development environment, but I'm running into some confusing
    results. ?My test case is to run multiple publishers sending messages
    over and over as fast as possible while also have a single consumer
    processing those messages with a delay. ?The goal is to obviously
    force messages to pile up in memory on the node to trigger the memory
    alert. ?Since I have both publishers and consumers connected, I'm
    expecting to see the consumers begin to get some sort of exception
    saying that they cant submit anymore work while the consumer continues
    to process. ?But what I'm actually seeing is that the publishers
    continue piling on and the consumer continues to process, but the
    machine eventually runs out of disk space and crashes.
    Is there anyone that can advise me on what I'm doing wrong or help me
    figure out what changes I can make?
    Thanks
    Andy
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthias Radestock at Dec 22, 2011 at 7:48 pm

    On 22/12/11 19:35, AndyB wrote:
    But this gets me to my next concern ... The throttling of the
    publishers appears to be a blocking operation from within the
    "BasicPublish" method and as best I can tell, I'm not seeing any
    sort of timeout. This indefinite blocking would be pretty bad if it
    were to occur in my production environment.
    It's not going to block indefinitely, since the paging to disk, or the
    consumption of messages, will free up space, at which point the
    producers are unblocked.

    Just think of this situation as being the same as a slow network /
    server; it's indistinguishable from that.
    Is there a way that I can specify some sort of timeout for the
    blocking operation?
    No.

    Matthias.
  • AndyB at Dec 22, 2011 at 7:53 pm
    In my test case, I have intentionally coded the consumer to never
    catch up. During the process, I saw the publishers get blocked after
    the alert, messages were streamed to disk enough to get below the
    watermark, and the publishers were unblocked. Of course they hit the
    watermark soon after and the same process happened again. I'd say
    this happened maybe 4 or 5 times and then they just remained in a
    blocked state. I let the test continue running for almost 10 minutes
    and they never became unblocked and the watermark alert never seemed
    to clear on the server. So I guess that means that it stopped
    streaming the messages to disk or something? Either way, I'm going to
    have to come up with a way to implement something in my code to try to
    avoid tying up a thread for an unknown amount of time. Any ideas? Is
    there a way to subscribe to the watermark event or something?

    Andy
    On Dec 22, 2:48?pm, Matthias Radestock wrote:
    On 22/12/11 19:35, AndyB wrote:

    But this gets me to my next concern ... The throttling of the
    publishers appears to be a blocking operation from within the
    "BasicPublish" method and as best I can tell, I'm not seeing any
    sort of timeout. ?This indefinite blocking would be pretty bad if it
    were to occur in my production environment.
    It's not going to block indefinitely, since the paging to disk, or the
    consumption of messages, will free up space, at which point the
    producers are unblocked.

    Just think of this situation as being the same as a slow network /
    server; it's indistinguishable from that.
    Is there a way that I can specify some sort of timeout for the
    blocking operation?
    No.

    Matthias.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthias Radestock at Dec 22, 2011 at 8:21 pm
    Andy,
    On 22/12/11 19:53, AndyB wrote:
    In my test case, I have intentionally coded the consumer to never
    catch up. During the process, I saw the publishers get blocked
    after the alert, messages were streamed to disk enough to get below
    the watermark, and the publishers were unblocked. Of course they hit
    the watermark soon after and the same process happened again. I'd
    say this happened maybe 4 or 5 times and then they just remained in
    a blocked state. I let the test continue running for almost 10
    minutes and they never became unblocked and the watermark alert never
    seemed to clear on the server. So I guess that means that it
    stopped streaming the messages to disk or something?
    It's possible that you ran into another limit...

    Each message has a small memory footprint, even when it has been paged
    to disk. So there is an upper bound to how many messages rabbit can hold
    on to. When that bound is reached producers will remain blocked until
    some messages have been consumed.

    There is way around that - changing the message store index module to
    one that is operating entirely on disk. See
    https://github.com/rabbitmq/rabbitmq-toke. However, I don't know of any
    production rabbits that have actually run into this limitation.
    Either way, I'm going to have to come up with a way to implement
    something in my code to try to avoid tying up a thread for an unknown
    amount of time. Any ideas?
    You could perform all the invocations of the AMQP client's publish
    methods from a single, separate thread. It would sit in a loop, pulling
    messages off a bounded buffer / queue (e.g. an ArrayBlockingQueue if
    this was Java; there are presumably similar data structures in C#, and
    worst case you could roll your own) and invoking the publish methods in
    the AMQP client.

    The "real" publishing threads simply deposit messages into the buffer /
    queue using an operation with a timeout, e.g. BlockingQueue.offer(E o,
    long timeout, TimeUnit unit).


    Matthias.
  • Jason J. W. Williams at Dec 22, 2011 at 8:02 pm

    Is there a way that I can specify some sort of timeout for the
    blocking operation?

    No.
    I think a timeout would be a useful setting on the client side. If
    Rabbit backs up enough, you don't want your frontend just appearing to
    hang to the frontend's client. At some point it should raise a timeout
    exception you can handle and give feedback to the app's client.

    -J
  • Matthias Radestock at Dec 22, 2011 at 8:25 pm

    On 22/12/11 20:02, Jason J. W. Williams wrote:
    I think a timeout would be a useful setting on the client side. If
    Rabbit backs up enough, you don't want your frontend just appearing to
    hang to the frontend's client. At some point it should raise a timeout
    exception you can handle and give feedback to the app's client.
    There's a bug open to address that. Alas it's been open for >1 year and
    is a major piece of work across all our clients. So I doubt we will work
    on it any time soon.

    Matthias.
  • AndyB at Dec 22, 2011 at 8:31 pm
    This sounds brutal and exactly what I was afraid of.

    Andy
    On Dec 22, 3:25?pm, Matthias Radestock wrote:
    On 22/12/11 20:02, Jason J. W. Williams wrote:

    I think a timeout would be a useful setting on the client side. If
    Rabbit backs up enough, you don't want your frontend just appearing to
    hang to the frontend's client. At some point it should raise a timeout
    exception you can handle and give feedback to the app's client.
    There's a bug open to address that. Alas it's been open for >1 year and
    is a major piece of work across all our clients. So I doubt we will work
    on it any time soon.

    Matthias.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthias Radestock at Dec 22, 2011 at 7:30 pm

    On 22/12/11 16:12, AndyB wrote:
    My test case is to run multiple publishers sending messages
    over and over as fast as possible while also have a single consumer
    processing those messages with a delay. The goal is to obviously
    force messages to pile up in memory on the node to trigger the memory
    alert. Since I have both publishers and consumers connected, I'm
    expecting to see the consumers begin to get some sort of exception
    saying that they cant submit anymore work while the consumer continues
    to process. But what I'm actually seeing is that the publishers
    continue piling on and the consumer continues to process, but the
    machine eventually runs out of disk space and crashes.
    Rabbit does not enforce any disk space limits. So if producers keep
    publishing messages, and consumers do not keep up, rabbit will page more
    and more messages to disk and eventually disk space runs out.

    Limiting disk space usage, similar to the way rabbit limits ram usage,
    is on our todo list. Though in practice most systems have ample disk
    space and operational monitoring usually detects situations that fill up
    the disk with plenty of time to take remedial action.

    Matthias.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedDec 22, '11 at 4:12p
activeDec 22, '11 at 8:31p
posts10
users4
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2022 Grokbase