Hello,

I'm looking for ways to debug a rabbitmq server that has been running
since 2010-02-01. Sometime in the last 2 days it stopped delivering
messages. I've seen this happen before, and restarting fixes it, but
I'd like to get to the bottom of why it stops delivering messages, so
I'd prefer to run tests while it is still in this state.

We're using the python amqplib client. The various versions are:

$ uname -srvmpio
Linux 2.6.32-37-generic-pae #81-Ubuntu SMP Fri Dec 2 22:24:22 UTC 2011
i686 unknown unknown GNU/Linux

$ sudo rabbitmqctl report
Reporting server status on {{2012,2,23},{2,18,37}}

Status of node 'rabbit at rabbitmq1' ...
[{pid,11649},
{running_applications,[{rabbit,"RabbitMQ","2.7.1"},
{os_mon,"CPO CXC 138 46","2.2.4"},
{sasl,"SASL CXC 138 11","2.1.8"},
{mnesia,"MNESIA CXC 138 12","4.4.12"},
{stdlib,"ERTS CXC 138 10","1.16.4"},
{kernel,"ERTS CXC 138 10","2.13.4"}]},
{os,{unix,linux}},
{erlang_version,"Erlang R13B03 (erts-5.7.4) [source] [rq:1]
[async-threads:30] [hipe] [kernel-poll:true]\n"},
{memory,[{total,107054896},
{processes,72060236},
{processes_used,72055268},
{system,34994660},
{atom,736077},
{atom_used,727538},
{binary,15125896},
{code,6063325},
{ets,10643172}]},
{vm_memory_high_watermark,0.3999999992195367},
{vm_memory_limit,102503219}]
...


Python 2.6.5
python-amqplib 0.6.1-1


I've used this python test program:
#----------------------------------------------
"""
To setup for this test you must say (on rabbitmq1):
sudo rabbitmqctl add_user scheduler scheduler
sudo rabbitmqctl add_vhost /test
sudo rabbitmqctl set_permissions -p /test scheduler ".*" ".*" ".*"
"""
from amqplib import client_0_8 as amqp

conn = amqp.Connection(host="rabbitmq1:5672", userid="scheduler",
password="scheduler", virtual_host="/test", insistúlse)
chan = conn.channel()
chan.queue_declare("alistair_test_q", durable=True, auto_deleteúlse)
chan.exchange_declare("alistair_test_ex", type="direct", durable=True,
auto_deleteúlse)
chan.queue_bind("alistair_test_q", "alistair_test_ex", routing_key="alistair")

msg = amqp.Message("test", delivery_mode=2)
print "tx_select"
chan.tx_select()
print "send message"
chan.basic_publish(msg, "alistair_test_ex", "alistair")
print "tx_commit"
chan.tx_commit()
#----------------------------------------------


It hangs on the chan.tx_commit().

tcpdump shows this (pruned summary, 172.22.2.53 is the client,
172.22.2.50 is the rabbitmq server):

23 0.047038 172.22.2.53 172.22.2.50 AMQP Tx.Select
24 0.047369 172.22.2.50 172.22.2.53 AMQP Tx.Select-Ok
25 0.047607 172.22.2.53 172.22.2.50 AMQP Basic.Publish
26 0.086293 172.22.2.50 172.22.2.53 TCP amqp > 34000 [ACK] Seq65
Ack85 Winh64 Len=0 TSV•1060977 TSER•1060626
27 0.086469 172.22.2.53 172.22.2.50 AMQP Content-Header Content-Body Tx.Commit
28 0.126293 172.22.2.50 172.22.2.53 TCP amqp > 34000 [ACK] Seq65
AckC2 Winh64 Len=0 TSV•1060987 TSER•1060636

You can see the client Tx.Commit (and it gets an ACK) but no
Tx.Commit-Ok comes back from the server. The client hangs here until I
ctrl-C it.

$ sudo rabbitmqctl list_queues -p /test name messages
Listing queues ...
alistair_test_q 0
...done.


A good session on a different broker shows this (again pruned,
172.22.3.2 is client, 172.22.3.3 is rabbitmq):

23 0.051783 172.22.3.2 172.22.3.3 AMQP Tx.Select
24 0.052205 172.22.3.3 172.22.3.2 AMQP Tx.Select-Ok
25 0.052517 172.22.3.2 172.22.3.3 AMQP Basic.Publish
26 0.090882 172.22.3.3 172.22.3.2 TCP amqp > 45215 [ACK] Seq65
Ack85 Winh64 Len=0 TSV$87670438 TSER!60225287
27 0.091095 172.22.3.2 172.22.3.3 AMQP Content-Header Content-Body Tx.Commit
28 0.091288 172.22.3.3 172.22.3.2 TCP amqp > 45215 [ACK] Seq65
AckC2 Winh64 Len=0 TSV$87670438 TSER!60225297
29 0.097365 172.22.3.3 172.22.3.2 AMQP Tx.Commit-Ok
30 0.097703 172.22.3.2 172.22.3.3 AMQP Tx.Select
31 0.097926 172.22.3.3 172.22.3.2 AMQP Tx.Select-Ok
32 0.098190 172.22.3.2 172.22.3.3 AMQP Basic.Get
33 0.098900 172.22.3.3 172.22.3.2 AMQP Basic.Get-Ok Content-Header Content-Body
34 0.099780 172.22.3.2 172.22.3.3 AMQP Tx.Commit
35 0.100436 172.22.3.3 172.22.3.2 AMQP Tx.Commit-Ok

$ sudo rabbitmqctl list_queues -p /test name messages
Listing queues ...
alistair_test_q 1
...done.


What else can I do to figure out what is wrong? Is there a way to get
more verbose logging without restarting the server? Are there some
other tools that'll help with debugging this?

Thanks,
Alistair

Search Discussions

  • Marek Majkowski at Feb 23, 2012 at 11:34 am

    On Thu, Feb 23, 2012 at 02:47, Alistair Bayley wrote:
    I'm looking for ways to debug a rabbitmq server that has been running
    since 2010-02-01. Sometime in the last 2 days it stopped delivering
    messages. I've seen this happen before, and restarting fixes it, but
    I'd like to get to the bottom of why it stops delivering messages, so
    I'd prefer to run tests while it is still in this state.

    We're using the python amqplib client. The various versions are:

    $ uname -srvmpio
    Linux 2.6.32-37-generic-pae #81-Ubuntu SMP Fri Dec 2 22:24:22 UTC 2011
    i686 unknown unknown GNU/Linux

    $ sudo rabbitmqctl report
    Reporting server status on {{2012,2,23},{2,18,37}}

    Status of node 'rabbit at rabbitmq1' ...
    [{pid,11649},
    ?{running_applications,[{rabbit,"RabbitMQ","2.7.1"},
    ? ? ? ? ? ? ? ? ? ? ? ?{os_mon,"CPO ?CXC 138 46","2.2.4"},
    ? ? ? ? ? ? ? ? ? ? ? ?{sasl,"SASL ?CXC 138 11","2.1.8"},
    ? ? ? ? ? ? ? ? ? ? ? ?{mnesia,"MNESIA ?CXC 138 12","4.4.12"},
    ? ? ? ? ? ? ? ? ? ? ? ?{stdlib,"ERTS ?CXC 138 10","1.16.4"},
    ? ? ? ? ? ? ? ? ? ? ? ?{kernel,"ERTS ?CXC 138 10","2.13.4"}]},
    ?{os,{unix,linux}},
    ?{erlang_version,"Erlang R13B03 (erts-5.7.4) [source] [rq:1]
    [async-threads:30] [hipe] [kernel-poll:true]\n"},
    ?{memory,[{total,107054896},
    ? ? ? ? ?{processes,72060236},
    ? ? ? ? ?{processes_used,72055268},
    ? ? ? ? ?{system,34994660},
    ? ? ? ? ?{atom,736077},
    ? ? ? ? ?{atom_used,727538},
    ? ? ? ? ?{binary,15125896},
    ? ? ? ? ?{code,6063325},
    ? ? ? ? ?{ets,10643172}]},
    ?{vm_memory_high_watermark,0.3999999992195367},
    ?{vm_memory_limit,102503219}]
    Hi,

    You seem to have quite a small amount of RAM - 100MiB, right?

    Do logs show high memory watermark being set?

    If so: rabbitmq tries not to go above a memory limit.
    When rabbit hits memory limit it stops consuming messages -
    all connections that produce messages will be stopped.
    The connection won't be touched until some memory is freed.

    But in setups where total memory is small, RabbitMQ may
    enter a state of just being above memory watermark and not
    being able to free anything. This may be happening to you.

    Cheers,
    Marek
  • Simon MacMullen at Feb 23, 2012 at 12:01 pm

    On 23/02/12 11:34, Marek Majkowski wrote:
    You seem to have quite a small amount of RAM - 100MiB, right?

    Do logs show high memory watermark being set?
    More to the point, the OP's "rabbitmqctl report" output shows memory use
    above the watermark. So yes...
    If so: rabbitmq tries not to go above a memory limit.
    When rabbit hits memory limit it stops consuming messages -
    As in, it will block on message publish. In tx mode that equates to a
    commit.
    all connections that produce messages will be stopped.
    The connection won't be touched until some memory is freed.

    But in setups where total memory is small, RabbitMQ may
    enter a state of just being above memory watermark and not
    being able to free anything. This may be happening to you.
    ...and you can set vm_memory_high_watermark higher to work around this.
    See http://www.rabbitmq.com/memory.html.

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Alistair Bayley at Feb 23, 2012 at 4:58 pm

    On 24 February 2012 01:01, Simon MacMullen wrote:
    On 23/02/12 11:34, Marek Majkowski wrote:

    You seem to have quite a small amount of RAM - 100MiB, right?

    Do logs show high memory watermark being set?

    More to the point, the OP's "rabbitmqctl report" output shows memory use
    above the watermark. So yes...

    If so: rabbitmq tries not to go above a memory limit.
    When rabbit hits memory limit it stops consuming messages -

    As in, it will block on message publish. In tx mode that equates to a
    commit.

    all connections that produce messages will be stopped.
    The connection won't be touched until some memory is freed.

    But in setups where total memory is small, RabbitMQ may
    enter a state of just being above memory watermark and not
    being able to free anything. This may be happening to you.

    ...and you can set vm_memory_high_watermark higher to work around this. See
    http://www.rabbitmq.com/memory.html.
    Thanks.

    Why is it not able to free memory? A rabbitmqctl report on a freshly
    started server shows total memory about 10M, what causes it to grow
    and not shrink back? This server is currently doing literally nothing
    - all queues are empty - so I'd like to know why total memory doesn't
    drop back down to something close to the initial state.

    It is running on a deliberately small VM. As it is the only service on
    the VM (the VM exists solely to host rabbitmq) we can increase the
    vm_memory_high_watermark . I would try 80%, is there a reason to go
    higher (or not)?.

    Alistair
  • Marek Majkowski at Feb 23, 2012 at 5:52 pm

    On Thu, Feb 23, 2012 at 16:58, Alistair Bayley wrote:
    On 24 February 2012 01:01, Simon MacMullen wrote:
    On 23/02/12 11:34, Marek Majkowski wrote:

    You seem to have quite a small amount of RAM - 100MiB, right?

    Do logs show high memory watermark being set?

    More to the point, the OP's "rabbitmqctl report" output shows memory use
    above the watermark. So yes...

    If so: rabbitmq tries not to go above a memory limit.
    When rabbit hits memory limit it stops consuming messages -

    As in, it will block on message publish. In tx mode that equates to a
    commit.

    all connections that produce messages will be stopped.
    The connection won't be touched until some memory is freed.

    But in setups where total memory is small, RabbitMQ may
    enter a state of just being above memory watermark and not
    being able to free anything. This may be happening to you.

    ...and you can set vm_memory_high_watermark higher to work around this. See
    http://www.rabbitmq.com/memory.html.
    Thanks.

    Why is it not able to free memory? A rabbitmqctl report on a freshly
    started server shows total memory about 10M, what causes it to grow
    and not shrink back? This server is currently doing literally nothing
    - all queues are empty - so I'd like to know why total memory doesn't
    drop back down to something close to the initial state.

    It is running on a deliberately small VM. As it is the only service on
    the VM (the VM exists solely to host rabbitmq) we can increase the
    vm_memory_high_watermark . I would try 80%, is there a reason to go
    higher (or not)?.
    Unfortunately Erlang memory handling is quite complex. By default,
    Erlang tries to optimize what it can, and that may mean allocating more memory
    that you would want to.

    Basically - Erlang VM is quite heavy even if it is not doing much.

    I'm afraid you just need to:
    a) buy more RAM
    b) or, tweak the vm_memory_high_watermark . You can even set
    it to be more than 100% - if you don't mind getting erlang
    process into swap sometimes.
    But in such case, it's better to make sure that you won't have much
    data residing in queues. (It should be okay to have high throughput
    through rabbit, as far as consumers are keeping up)

    Marek
  • Marek Majkowski at Feb 23, 2012 at 5:55 pm

    Unfortunately Erlang memory handling is quite complex. By default,
    Erlang tries to optimize what it can, and that may mean allocating more memory
    that you would want to.

    Basically - Erlang VM is quite heavy even if it is not doing much.

    I'm afraid you just need to:
    ?a) buy more RAM
    ?b) or, tweak the vm_memory_high_watermark . You can even set
    ?it to be more than 100% - if you don't mind getting erlang
    ?process into swap sometimes.
    ?But in such case, it's better to make sure that you won't have much
    ?data residing in queues. (It should be okay to have high throughput
    ? through rabbit, as far as consumers are keeping up)
    You could also do:
    c) Tweak erlang gc: http://www.erlang.org/doc/man/erts_alloc.html
    but this is generally discouraged and hard enough to be a topic
    for a PhD thesis.

    Marek
  • Simon MacMullen at Feb 23, 2012 at 5:57 pm

    On 23/02/12 16:58, Alistair Bayley wrote:
    Why is it not able to free memory? A rabbitmqctl report on a freshly
    started server shows total memory about 10M, what causes it to grow
    and not shrink back? This server is currently doing literally nothing
    - all queues are empty - so I'd like to know why total memory doesn't
    drop back down to something close to the initial state.
    The report you posted showed most memory in the VM being used by Erlang
    processes. You could run something like:

    rabbitmqctl eval
    'lists:sublist(lists:reverse(lists:sort([{process_info(Pid,memory), Pid,
    process_info(Pid)} || Pid <- processes()])), 10).'

    to list the top 10 processes by memory use, along with some information
    for each one. This may tell us something interesting.

    Having said that, 100Mb for an idle server does not seem totally
    unreasonable - RabbitMQ is not optimised for small memory use when idle.
    It is running on a deliberately small VM. As it is the only service on
    the VM (the VM exists solely to host rabbitmq) we can increase the
    vm_memory_high_watermark . I would try 80%, is there a reason to go
    higher (or not)?.
    The reason we set 0.4 is that in a worst case scenario memory use can
    double during GC. That's relatively unlikely though and assuming you
    have some swapspace as well it will not be too bad if you do.

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Alistair Bayley at Feb 23, 2012 at 8:00 pm

    On 24 February 2012 06:57, Simon MacMullen wrote:
    The report you posted showed most memory in the VM being used by Erlang
    processes. You could run something like:

    rabbitmqctl eval
    'lists:sublist(lists:reverse(lists:sort([{process_info(Pid,memory), Pid,
    process_info(Pid)} || Pid <- processes()])), 10).'

    to list the top 10 processes by memory use, along with some information for
    each one. This may tell us something interesting.
    Just ran this:

    $ sudo rabbitmqctl eval
    'lists:sublist(lists:reverse(lists:sort([{process_info(Pid,memory),
    Pid, process_info(Pid)} || Pid <- processes()])), 10).'

    (wait for a bit - about 30s or so...)

    Crash dump was written to: erl_crash.dump
    eheap_alloc: Cannot allocate 153052320 bytes of memory (of type "old_heap").
    Aborted
    $

    hmmm....

    Having said that, 100Mb for an idle server does not seem totally
    unreasonable - RabbitMQ is not optimised for small memory use when idle.
    OK. Can I have some guidance as to how much memory we should allocate?
    The ubuntu VMs we use have 256M by default; this is usually more than
    enough for our needs. Is that enough for rabbitmq? Let's assume that
    we have a small number of queues (6 or so) and the queues will contain
    a small number (in the 100s) of fairly small messages (a few hundred
    bytes each). The consumers are aggressive; messages should not remain
    on queues for long, although they are persistent.

    I don't see much in the docs about memory requirements. I think I'll
    start by setting the vm_memory_high_watermark to 0, and seeing where
    the memory use plateaus (assuming it does).

    Also, I see from the tcpdump capture that the server does not send
    back a response to the Tx.Commit, which presumably makes it difficult
    for the client to detect that the message has not been accepted (other
    than timeout). Indeed, the basic_publish returns nothing immediately,
    although that could be a feature of the amqplib library. Is there an
    option to get the server to return something that says that it is not
    accepting messages?

    Thanks,
    Alistair
  • Alistair Bayley at Feb 23, 2012 at 9:19 pm

    On 24 February 2012 09:00, Alistair Bayley wrote:
    I don't see much in the docs about memory requirements. I think I'll
    start by setting the vm_memory_high_watermark to 0, and seeing where
    the memory use plateaus (assuming it does).
    OK. I've just set the vm_memory_high_watermark to 0 and now the broker
    won't accept any messages. In the log:

    =INFO REPORT==== 24-Feb-2012::09:18:29 ===
    Limiting to approx 924 file handles (829 sockets)

    =INFO REPORT==== 24-Feb-2012::09:18:30 ===
    Memory limit set to 0MB of 244MB total.

    =INFO REPORT==== 24-Feb-2012::09:18:30 ===
    vm_memory_high_watermark set. Memory used:12963672 allowed:0

    =INFO REPORT==== 24-Feb-2012::09:18:30 ===
    alarm_handler: {set,
    {{vm_memory_high_watermark,
    'rabbit at kollector-staging-rabbitmq1'},
    []}}

    In the docs (http://www.rabbitmq.com/memory.html) it says: "A value of
    0 disables the memory monitor and throttling of producers."

    I took this to mean that there is no throttling i.e. messages would
    always be accepted. Have I misunderstood?

    Alistair
  • Simon MacMullen at Feb 24, 2012 at 11:21 am

    On 23/02/12 21:19, Alistair Bayley wrote:
    OK. I've just set the vm_memory_high_watermark to 0 and now the broker
    won't accept any messages. In the log:
    Damn, that documentation was wrong. I've fixed it. Try setting it to 100...

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Simon MacMullen at Feb 24, 2012 at 11:38 am

    On 23/02/12 20:00, Alistair Bayley wrote:
    Just ran this:

    $ sudo rabbitmqctl eval
    'lists:sublist(lists:reverse(lists:sort([{process_info(Pid,memory),
    Pid, process_info(Pid)} || Pid<- processes()])), 10).'

    (wait for a bit - about 30s or so...)

    Crash dump was written to: erl_crash.dump
    eheap_alloc: Cannot allocate 153052320 bytes of memory (of type "old_heap").
    Aborted
    $

    hmmm....
    Hmm indeed. If it ran out of memory doing that, then that implies you
    might have a ton of processes hanging around for some reason. Either
    that or you were very close to the edge anyway.

    rabbitmqctl eval 'length(processes()).' will tell you how many processes
    are in use in total.

    So you may want to try the second rabbitmqctl eval command above, and
    then the first one with more memory available to the server.
    Having said that, 100Mb for an idle server does not seem totally
    unreasonable - RabbitMQ is not optimised for small memory use when idle.
    OK. Can I have some guidance as to how much memory we should allocate?
    This is a real how-long-is-a-piece-of-string question I'm afraid. 512M
    would already be a significant step towards the comfort zone though.
    Also, I see from the tcpdump capture that the server does not send
    back a response to the Tx.Commit, which presumably makes it difficult
    for the client to detect that the message has not been accepted (other
    than timeout). Indeed, the basic_publish returns nothing immediately,
    although that could be a feature of the amqplib library. Is there an
    option to get the server to return something that says that it is not
    accepting messages?
    In ancient history we used to do this; there's a method in AMQP 0-9-1
    called channel.flow which the server should be able to use to tell a
    client to shut up. In practice it was unreliable (client may not support
    it, or may take too long to take notice of it).

    We could just send channel.flows informationally but again this would
    require client support.

    I am currently wondering about modifying our clients so they have a
    non-blocking publish (see the thread "Ensuring low latency for
    publishers"). This should not be too hard to do. However, this might not
    help when using transactions. OTOH blocking in such a way that we don't
    send back a tx.commit is unhelpful; I'll see how easy it would be to
    change this.

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Alistair Bayley at Feb 27, 2012 at 3:37 am

    On 25 February 2012 00:38, Simon MacMullen wrote:

    OK. Can I have some guidance as to how much memory we should allocate?
    This is a real how-long-is-a-piece-of-string question I'm afraid. 512M would
    already be a significant step towards the comfort zone though.
    rabbitmqctl report shows a large number of channels (in the thousands)
    and memory usage seems to correlate with the large number of channels.
    I've left the server running and shut down each of the clients, and
    the channels have vanished from the report output, so I suspect a
    dodgy client here. We're using python kombu over ampqlib, so maybe
    kombu is doing something undesirable. A quick glance at the source
    shows that it has a channel pool, so maybe that's to blame. Anyway,
    more investigation required...

    Once the number of channels is down to less than 10, the memory usage
    is around 32M, which seems fine to me. Maybe I don't need to allocate
    more, if I can solve this channels problem.

    Is there an
    option to get the server to return something that says that it is not
    accepting messages?
    In ancient history we used to do this; there's a method in AMQP 0-9-1 called
    channel.flow which the server should be able to use to tell a client to shut
    up. In practice it was unreliable (client may not support it, or may take
    too long to take notice of it).
    I will do some more testing. Maybe the immediate flag on basic_publish
    will do something useful. python amqplib is protocol version 0.8, so
    whatever I do must fit in with that.

    Alistair
  • Marek Majkowski at Feb 27, 2012 at 11:49 am

    On Mon, Feb 27, 2012 at 03:37, Alistair Bayley wrote:
    On 25 February 2012 00:38, Simon MacMullen wrote:

    OK. Can I have some guidance as to how much memory we should allocate?
    This is a real how-long-is-a-piece-of-string question I'm afraid. 512M would
    already be a significant step towards the comfort zone though.
    rabbitmqctl report shows a large number of channels (in the thousands)
    and memory usage seems to correlate with the large number of channels.
    I've left the server running and shut down each of the clients, and
    the channels have vanished from the report output, so I suspect a
    dodgy client here. We're using python kombu over ampqlib, so maybe
    kombu is doing something undesirable. A quick glance at the source
    shows that it has a channel pool, so maybe that's to blame. Anyway,
    more investigation required...
    Right. This is definitely possible that channels are the main
    thing that consumes memory.
    Once the number of channels is down to less than 10, the memory usage
    is around 32M, which seems fine to me. Maybe I don't need to allocate
    more, if I can solve this channels problem.
    Yup, quite possible.
    Is there an
    option to get the server to return something that says that it is not
    accepting messages?
    In ancient history we used to do this; there's a method in AMQP 0-9-1 called
    channel.flow which the server should be able to use to tell a client to shut
    up. In practice it was unreliable (client may not support it, or may take
    too long to take notice of it).
    I will do some more testing. Maybe the immediate flag on basic_publish
    will do something useful. python amqplib is protocol version 0.8, so
    whatever I do must fit in with that.
    Nope, "immediate" flag is a completely unrelated:
    http://www.rabbitmq.com/amqp-0-9-1-reference.html#basic.publish

    Marek

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedFeb 23, '12 at 2:47a
activeFeb 27, '12 at 11:49a
posts13
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2022 Grokbase