I have a four-node 2.7.1 cluster. I just started experimenting with
mirrored queues. One queue is mirrored across nodes 1&2, a second queue
is mirrored across nodes 3&4. I've been feeding a lot of large messages
through, using delivery-mode 2. Most of the messages I've purged, since
the reader process can't keep up yet.

Here's the problem: Memory usage. Nodes 1 & 3, presumably the master
nodes for the queues, have maintained a normal memory profile, 3-6GB.
But nodes 2 & 4, the presumable slaves, have had their memory grow to
58GB each. Worse, when I purged and then even deleted the queues, the
memory usage did not go down. It seems I may have to reboot these nodes
to get the memory back, and obviously I can't use mirrored queues if
they're going to make my nodes do this, which is disappointing. I do
have a workaround involving alternate exchanges, but the workaround can
leave data stranded if a node is lost forever.

Is there any other info I can provide to help diagnose and/or fix this?

Search Discussions

  • Travis at Feb 22, 2012 at 9:58 pm

    On Wed, Feb 22, 2012 at 2:12 PM, Reverend Chip wrote:
    I have a four-node 2.7.1 cluster. ?I just started experimenting with
    mirrored queues. ?One queue is mirrored across nodes 1&2, a second queue
    is mirrored across nodes 3&4. ?I've been feeding a lot of large messages
    through, using delivery-mode 2. ?Most of the messages I've purged, since
    the reader process can't keep up yet.

    Here's the problem: Memory usage. ?Nodes 1 & 3, presumably the master
    nodes for the queues, have maintained a normal memory profile, 3-6GB.
    But nodes 2 & 4, the presumable slaves, have had their memory grow to
    58GB each. ?Worse, when I purged and then even deleted the queues, the
    memory usage did not go down. ?It seems I may have to reboot these nodes
    to get the memory back, and obviously I can't use mirrored queues if
    they're going to make my nodes do this, which is disappointing. ?I do
    have a workaround involving alternate exchanges, but the workaround can
    leave data stranded if a node is lost forever.

    Is there any other info I can provide to help diagnose and/or fix this?
    I think we're experiencing the same thing. Previously, we were seeing
    a memory leak in 2.6.1 which was patched[1] in subsequent releases.
    Since then, we've upgraded to 2.7.1 as well and we're seeing slowly
    growing memory usage on our slaves that requires us to restart the
    slaves periodically to keep the memory usage down.

    In our case, we're using only two nodes with mirrored queues.

    Travis

    [1] this patch http://hg.rabbitmq.com/rabbitmq-server/rev/89315378597d
    fixed our memory leak in 2.6.1. worked great when we applied just the
    patch to 2.6.1. it has to be something different in 2.7.1 causing
    this new memory leak though.


    --
    Travis Campbell
    travis at ghostar.org
  • Travis at Feb 22, 2012 at 11:24 pm

    On Wed, Feb 22, 2012 at 3:58 PM, Travis wrote:
    I think we're experiencing the same thing. ?Previously, we were seeing
    a memory leak in 2.6.1 which was patched[1] in subsequent releases.
    Since then, we've upgraded to 2.7.1 as well and we're seeing slowly
    growing memory usage on our slaves that requires us to restart the
    slaves periodically to keep the memory usage down.

    In our case, we're using only two nodes with mirrored queues.
    Also, just to show how bad the problem is, I've uploaded (to imgur) a
    graph of the memory utilization to for the slave over the last few
    weeks.

    http://i.imgur.com/dBNO8.png

    We currently set the vm highwater mark to 10% (which ends up being
    2.4GB of ram available to the system).

    As you can see, we had flat memory utilization up until about week 5.
    There's a small dip (where we turned things off for abit to upgrade
    from 2.6.1 (with the previously mentioned patch) to 2.7.1. After
    that, memory has grown pretty steadily. The points where it has going
    almost down to zero usage after that was right after restarts of the
    rabbitmq-server on this node.

    Travis
    --
    Travis Campbell
    travis at ghostar.org
  • DawgTool at Feb 23, 2012 at 6:37 pm
    Which version of Erlang are you running?
    I had the same issue with R13/R14, but after upgrading to R15B, no issues.
    Not sure if its related or if I just got lucky, but might be worth a shot.
    -- DawgTool

    ps> I also recall making some config changes, but don't recall at the
    moment. I'll look at those later today.

    On 2/22/12 3:12 PM, Reverend Chip wrote:
    I have a four-node 2.7.1 cluster. I just started experimenting with
    mirrored queues. One queue is mirrored across nodes 1&2, a second queue
    is mirrored across nodes 3&4. I've been feeding a lot of large messages
    through, using delivery-mode 2. Most of the messages I've purged, since
    the reader process can't keep up yet.

    Here's the problem: Memory usage. Nodes 1& 3, presumably the master
    nodes for the queues, have maintained a normal memory profile, 3-6GB.
    But nodes 2& 4, the presumable slaves, have had their memory grow to
    58GB each. Worse, when I purged and then even deleted the queues, the
    memory usage did not go down. It seems I may have to reboot these nodes
    to get the memory back, and obviously I can't use mirrored queues if
    they're going to make my nodes do this, which is disappointing. I do
    have a workaround involving alternate exchanges, but the workaround can
    leave data stranded if a node is lost forever.

    Is there any other info I can provide to help diagnose and/or fix this?

    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Simon MacMullen at Feb 24, 2012 at 10:48 am

    On 23/02/12 18:37, DawgTool wrote:
    ps> I also recall making some config changes, but don't recall at the
    moment. I'll look at those later today.
    That would be marvellous. Needless to say we've not been able to
    replicate this here.

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Chip Salzenberg at Feb 27, 2012 at 7:34 pm

    Simon MacMullen wrote:
    On 23/02/12 18:37, DawgTool wrote:
    ps> I also recall making some config changes, but don't recall at the
    moment. I'll look at those later today.
    That would be marvellous. Needless to say we've not been able to
    replicate this here.
    Our local Erlang expert is on vacation, but he rescued our cluster.
    He described a mailbox full of messages that some process was not
    consuming. When he killed the process and, IIRC, the "shell" that had
    spawned it (I am merely parroting what I recall, this may be the wrong
    term), the bloat was cured and memory usage returned to normal.

    Details that might matter in reproduction: Our troubled mirrored
    queue used mode "nodes" (not "all"), picking two nodes out of a
    cluster of four. When I created it, I used a modified rabbitmqadmin
    that connected to one of the two nodes, and specified that it and one
    other node should mirror the queue. All messages were written with
    delivery-mode 2. Messages are about 46K each. While some of the
    messages were consumed, most were purged. Queue deletion did not cure
    the leak.
  • Chip Salzenberg at Feb 27, 2012 at 7:50 pm

    On Feb 27, 11:34?am, Chip Salzenberg wrote:
    Simon MacMullen wrote:
    On 23/02/12 18:37, DawgTool wrote:
    ps> I also recall making some config changes, but don't recall at the
    moment. I'll look at those later today.
    That would be marvellous. Needless to say we've not been able to
    replicate this here.
    An additional note: Most messages were written by clients that did not
    require acks.
    Our local Erlang expert is on vacation, but he rescued our cluster.
    He described a mailbox full of messages that some process was not
    consuming. ?When he killed the process and, IIRC, the "shell" that had
    spawned it (I am merely parroting what I recall, this may be the wrong
    term), the bloat was cured and memory usage returned to normal.

    Details that might matter in reproduction: ?Our troubled mirrored
    queue used mode "nodes" (not "all"), picking two nodes out of a
    cluster of four. ?When I created it, I used a modified rabbitmqadmin
    that connected to one of the two nodes, and specified that it and one
    other node should mirror the queue. ?All messages were written with
    delivery-mode 2. ?Messages are about 46K each. ?While some of the
    messages were consumed, most were purged. ?Queue deletion did not cure
    the leak.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Simon MacMullen at Feb 28, 2012 at 9:50 am
    Thanks for the additional information. I'll have another go at
    replicating...
    On 27/02/12 19:34, Chip Salzenberg wrote:

    Our local Erlang expert is on vacation, but he rescued our cluster.
    He described a mailbox full of messages that some process was not
    consuming. When he killed the process and, IIRC, the "shell" that had
    spawned it (I am merely parroting what I recall, this may be the wrong
    term), the bloat was cured and memory usage returned to normal.
    I don't suppose you have any record of what type of process it was /
    what the messages looked like?

    Cheers, Simon
    Details that might matter in reproduction: Our troubled mirrored
    queue used mode "nodes" (not "all"), picking two nodes out of a
    cluster of four. When I created it, I used a modified rabbitmqadmin
    that connected to one of the two nodes, and specified that it and one
    other node should mirror the queue. All messages were written with
    delivery-mode 2. Messages are about 46K each. While some of the
    messages were consumed, most were purged. Queue deletion did not cure
    the leak.
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Simon MacMullen at Mar 5, 2012 at 4:45 pm
    Hi Max, thanks for writing. Are you the "local Erlang expert" Chip
    referred to?
    On 24/02/12 15:19, Max Kalika wrote:
    3) There was one process with over 2 million pages. It was constantly
    running lists:zipwith/3. Info on it showed lots and lots of messages
    in the mailbox that just wasn't decreasing.
    I don't suppose you have any record of what these messages were? Or any
    logs from around this time?
    I mentioned earlier that the system is *mostly* recovered. The
    remaining problem is disk utilization. I suspect that since our
    messages are marked durable, disk cleanup didn't occur. I'm not sure
    how to sync this up with runtime reality without restarting.
    So the only code in the rabbit codebase which invokes lists:zipwith/3 is
    the file_handle_cache. Hmm. Unfortunately it uses it in a rather generic
    location so this is still not exactly clear.

    This would help explain why disk cleanup didn't occur, since you killed
    something that was in the middle of doing some file handling.

    Hmm.

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Chip Salzenberg at Mar 9, 2012 at 6:42 pm
    Indeed he is. We're cycling through the sick cluster now, wiping
    nodes in turn, to recover from this problem (and also increase disk
    sizes).
    On Mar 5, 8:45?am, Simon MacMullen wrote:
    Hi Max, thanks for writing. Are you the "local Erlang expert" Chip
    referred to?
    On 24/02/12 15:19, Max Kalika wrote:

    3) There was one process with over 2 million pages. ?It was constantly
    running lists:zipwith/3. ?Info on it showed lots and lots of messages
    in the mailbox that just wasn't decreasing.
    I don't suppose you have any record of what these messages were? Or any
    logs from around this time?
    I mentioned earlier that the system is *mostly* recovered. ?The
    remaining problem is disk utilization. ?I suspect that since our
    messages are marked durable, disk cleanup didn't occur. ?I'm not sure
    how to sync this up with runtime reality without restarting.
    So the only code in the rabbit codebase which invokes lists:zipwith/3 is
    the file_handle_cache. Hmm. Unfortunately it uses it in a rather generic
    location so this is still not exactly clear.

    This would help explain why disk cleanup didn't occur, since you killed
    something that was in the middle of doing some file handling.

    Hmm.

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-disc... at lists.rabbitmq.comhttps://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Emile Joubert at Mar 12, 2012 at 5:08 pm

    On 24/02/12 15:19, Max Kalika wrote:
    This turned out to be worse than we first realized. When two of the
    servers exhibited high memory usage, clients became blocked and
    stopped receiving data. I was able to unwedge the running process
    with some erlang surgery. Obviously, this was a high-risk operation
    My attempt to to reproduce this issue was unsuccessful. It would be
    great if you could provide some more information about the process
    consuming the most memory, e.g. output from process_info/1. Does it
    respond to sys:get_status() ?


    -Emile

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedFeb 22, '12 at 8:12p
activeMar 12, '12 at 5:08p
posts11
users5
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2022 Grokbase