I just noticed one of our devices had a fairly low amount of free disk space...


As it turns out, there was a 600MB crash dump (erl_crash.dump) from Rabbit.
I've deleted it for now, but in the future, what would be useful to do with
these? At 600MB it's a bit unwieldy to transfer over a cellular connection.


- alex

Search Discussions

  • Alex Zepeda at Oct 2, 2012 at 2:09 pm

    On Sun, Sep 30, 2012 at 10:20:38AM +0100, Matthias Radestock wrote:


    The first handful of lines of the erl_crash.dump are the most
    interesting, so you could keep/transfer just that and discard the rest.

    Okay.


    Looks like the first three lines are most interesting:


    =erl_crash_dump:0.1
    Fri Sep 28 20:38:32 2012
    Slogan: eheap_alloc: Cannot allocate 747325720 bytes of memory (of type "heap").
    System version: Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:30] [hipe] [kernel-poll:true]
    Compiled: Fri Sep 24 19:15:42 2010


    Nothing else should be using a lot of memory, but I guess this points to a non-rabbit/erlang problem unless there have been memory leaks fixed since R13B03 + 2.8.6?


    I'm using the stock Ubuntu (11.04) packages for Erlang, and the debs from the Rabbit site.


    - alex
  • Matthias Radestock at Oct 3, 2012 at 9:24 pm
    Alex,

    On 02/10/12 15:09, Alex Zepeda wrote:
    =erl_crash_dump:0.1
    Fri Sep 28 20:38:32 2012
    Slogan: eheap_alloc: Cannot allocate 747325720 bytes of memory (of type "heap").
    System version: Erlang R13B03 (erts-5.7.4) [source] [64-bit] [smp:4:4] [rq:4] [async-threads:30] [hipe] [kernel-poll:true]
    Compiled: Fri Sep 24 19:15:42 2010

    Nothing else should be using a lot of memory,

    "should" - you may want to check that ;)


    Mind you, the above is trying to allocate 740MB - does that seem
    reasonable for your rabbit?

    unless there have been memory leaks fixed since R13B03 + 2.8.6?

    R13B03 is ancient, and countless bugs have been fixed since then,
    including memory leaks. Though, tbh, it's not the first place I'd look.
    Rabbit 2.8.7 plugs some leaks, but they were confined to HA queues (see
    the release notes for more details), so unless you are using those then
    upgrading won't help.


    Have you changed the memory watermark settings of rabbit at all? If not
    then rabbit really shouldn't run out of memory unless another program is
    stealing the memory, or your app is creating lots of exchanges, queues
    or bindings. Or lots of connections. Or you are using some exotic
    plug-ins with hitherto undiscovered bugs. In all those cases, grabbing
    the output of 'rabbitmqctl report' before rabbit dies should offer some
    clues.


    Regards,


    Matthias.
  • Alex Zepeda at Oct 7, 2012 at 5:18 am

    On 10/3/12 2:24 PM, Matthias Radestock wrote:


    R13B03 is ancient, and countless bugs have been fixed since then,
    including memory leaks. Though, tbh, it's not the first place I'd look.
    Rabbit 2.8.7 plugs some leaks, but they were confined to HA queues (see
    the release notes for more details), so unless you are using those then
    upgrading won't help.

    Have you changed the memory watermark settings of rabbit at all? If not
    then rabbit really shouldn't run out of memory unless another program is
    stealing the memory, or your app is creating lots of exchanges, queues
    or bindings. Or lots of connections. Or you are using some exotic
    plug-ins with hitherto undiscovered bugs. In all those cases, grabbing
    the output of 'rabbitmqctl report' before rabbit dies should offer some
    clues.

    The config changes are, I think, limited to shovel related items,
    logging, disk space, and TCP bits. These machines have under 2GB of RAM
    so allocating in excess of 700MB seems a bit odd.


    In an ideal world we'd see around 48,000 messages per day *at the very
    most*. In practice, we're running into problems where an order of
    magnitude more messages are being queued up under some circumstances...
    but I'd expect that rabbit should handle that gracefully and block the
    connection, no?


    - alex
  • Matthias Radestock at Oct 7, 2012 at 7:05 am
    Alex,

    On 07/10/12 06:18, Alex Zepeda wrote:
    These machines have under 2GB of RAM so allocating in excess of 700MB
    seems a bit odd.

    The total memory and high watermark are shown in the rabbit logs, e.g.
    s.t. like


    =INFO REPORT==== 3-Oct-2012::20:31:08 ===
    Memory limit set to 4814MB of 12036MB total.


    So check that these figures make sense for your setup.

    In an ideal world we'd see around 48,000 messages per day *at the
    very most*. In practice, we're running into problems where an order
    of magnitude more messages are being queued up under some
    circumstances... but I'd expect that rabbit should handle that
    gracefully and block the connection, no?

    When under memory pressure rabbit will page the messages to disk, and
    block producers to control the rate of message influx so it can keep up.
    Hence high message volumes should not cause rabbit to run out of memory.


    As suggested previously, when rabbit is using more memory than you
    expect the output of 'rabbitmqctl report' should shed some light on
    where it's going.


    Regards,


    Matthias.
  • Alex Zepeda at Oct 8, 2012 at 7:20 pm

    On Sun, Oct 07, 2012 at 08:05:22AM +0100, Matthias Radestock wrote:


    The total memory and high watermark are shown in the rabbit logs, e.g.
    s.t. like

    =INFO REPORT==== 3-Oct-2012::20:31:08 ===
    Memory limit set to 4814MB of 12036MB total.

    So check that these figures make sense for your setup.

    =INFO REPORT==== 8-Oct-2012::10:00:43 ===
    Memory limit set to 395MB of 988MB total.

    As suggested previously, when rabbit is using more memory than you
    expect the output of 'rabbitmqctl report' should shed some light on
    where it's going.

    I'll try, but generally the devices having the most trouble have the
    most unreliable connections, so logging in is often very difficult.


    The other thing I'm seeing is that shovel appears to be getting
    stuck (but the shovel status shows it's running) on a number
    of these devices. What sort of diagnostics would be useful
    here?


    - alex
  • Matthias Radestock at Oct 8, 2012 at 7:44 pm
    Alex,

    On 08/10/12 20:20, Alex Zepeda wrote:
    =INFO REPORT==== 8-Oct-2012::10:00:43 ===
    Memory limit set to 395MB of 988MB total.

    Right, so rabbit attempting to allocate 700Mb looks a bit suspicious but
    could still be ok, i.e. if it wasn't using much at the time.


    Could a client might be sending a very large message?

    As suggested previously, when rabbit is using more memory than you
    expect the output of 'rabbitmqctl report' should shed some light on
    where it's going.
    I'll try, but generally the devices having the most trouble have the
    most unreliable connections, so logging in is often very difficult.

    You may want to set up some automated monitoring/logging, so when a
    problem does arise you can look at the most recent reports in the post
    mortem.

    The other thing I'm seeing is that shovel appears to be getting
    stuck (but the shovel status shows it's running) on a number
    of these devices. What sort of diagnostics would be useful
    here?

    Have you got heartbeats enabled? If not then turn them on.


    Do the shovel connections show up at the destination?


    Btw, have you got a prefetch_count set in your shovel config and are
    running in an ack mode other than no_ack? If not then that might explain
    the unexpectedly high memory usage since a stuck shovel connection would
    cause messages to pile up in memory in the shovel.


    Regards,


    Matthias.
  • Alex Zepeda at Oct 8, 2012 at 8:27 pm

    On Mon, Oct 08, 2012 at 08:44:11PM +0100, Matthias Radestock wrote:


    Right, so rabbit attempting to allocate 700Mb looks a bit suspicious but
    could still be ok, i.e. if it wasn't using much at the time.

    Could a client might be sending a very large message?

    I suppose anything's possible. Is there some high water mark for message
    size? A typical message should be around 400 bytes of text.

    Have you got heartbeats enabled? If not then turn them on.

    Yes.

    Do the shovel connections show up at the destination?

    Yes, with a five second timeout.

    Btw, have you got a prefetch_count set in your shovel config and are
    running in an ack mode other than no_ack? If not then that might explain
    the unexpectedly high memory usage since a stuck shovel connection would
    cause messages to pile up in memory in the shovel.

    The shovel config looks like so:


    {shovel_one
    [ {sources,
    [ {broker, "amqp://"} ]}
    , {destinations,
    [ {broker, "amqp://user:password at remote.hostname?heartbeat=5"} ]}
    , {queue, <<"queue_name">>}
    , {ack_mode, on_confirm}
    , {publish_properties, [ {delivery_mode, 2} ]}
    , {reconnect_delay, 35}
    ]}
    ]}
    ]}


    - alex
  • Matthias Radestock at Oct 8, 2012 at 9:06 pm
    Alex,

    On 08/10/12 21:27, Alex Zepeda wrote:
    Is there some high water mark for message size?

    Nope. Preventing out-of-memory errors due to ingestion of enormous
    messages is on our to do list.

    Have you got heartbeats enabled? If not then turn them on. Yes.
    Do the shovel connections show up at the destination?
    Yes, with a five second timeout.

    So, just to be clear, you have connections that are "stuck" but do show
    up at the destination with a non-zero heartbeat/timeout? That's odd.

    The shovel config looks like so:

    {shovel_one
    [ {sources,
    [ {broker, "amqp://"} ]}
    , {destinations,
    [ {broker, "amqp://user:password at remote.hostname?heartbeat=5"} ]}
    , {queue, <<"queue_name">>}
    , {ack_mode, on_confirm}
    , {publish_properties, [ {delivery_mode, 2} ]}
    , {reconnect_delay, 35}
    ]}
    ]}
    ]}

    You can't got a prefetch_count set. This will cause all messages to pile
    up in memory in the event of a connection getting stuck.


    Regards,


    Matthias.
  • Alex Zepeda at Oct 8, 2012 at 9:55 pm

    On Mon, Oct 08, 2012 at 10:06:48PM +0100, Matthias Radestock wrote:


    So, just to be clear, you have connections that are "stuck" but do show
    up at the destination with a non-zero heartbeat/timeout? That's odd.

    The devices send messages to a local queue (shovel) and other messages
    directly to a remote rabbit server. Some of these devices are able to
    send out the remote messages, but the messages in the shovel queue
    aren't being sent. The last time I checked a device in this state,
    I ran the command: rabbitmqctl eval 'rabbit_shovel_status:status().'


    Which indicated to me that the shovel was active. I have not checked
    the remote rabbit instance (yet). All of the shovel configurations
    should be identical.


    If/when I succeed in logging into one of these stuck devices, are
    there any pertinent commands I should run?

    The shovel config looks like so:

    {shovel_one
    [ {sources,
    [ {broker, "amqp://"} ]}
    , {destinations,
    [ {broker, "amqp://user:password at remote.hostname?heartbeat=5"} ]}
    , {queue, <<"queue_name">>}
    , {ack_mode, on_confirm}
    , {publish_properties, [ {delivery_mode, 2} ]}
    , {reconnect_delay, 35}
    ]}
    ]}
    ]}
    You can't got a prefetch_count set. This will cause all messages to pile
    up in memory in the event of a connection getting stuck.

    Should I set prefetch_count? The queue in this case is durable, and
    messages should be delivered to the local queue with delivery_mode = 2.


    - alex
  • Matthias Radestock at Oct 8, 2012 at 10:02 pm
    Alex,

    On 08/10/12 22:55, Alex Zepeda wrote:
    On Mon, Oct 08, 2012 at 10:06:48PM +0100, Matthias Radestock wrote: I
    have not checked the remote rabbit instance (yet).

    Please do.

    If/when I succeed in logging into one of these stuck devices, are
    there any pertinent commands I should run?

    Not that I can think of. But please do check the logs for any abnormalities.

    Should I set prefetch_count?

    Sorry, bad grammar on my part was obscuring the point there. Yes, you
    should.

    The queue in this case is durable, and messages should be delivered
    to the local queue with delivery_mode = 2.

    W/o a prefetch_count the messages won't stay in the local queue; they
    will be sent to the shovel, where, if the outbound connect is stuck,
    they will sit in memory.


    Regards,


    Matthias.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedSep 29, '12 at 1:20p
activeOct 8, '12 at 10:02p
posts11
users2
websiterabbitmq.com
irc#rabbitmq

2 users in discussion

Alex Zepeda: 6 posts Matthias Radestock: 5 posts

People

Translate

site design / logo © 2017 Grokbase