Hi,

I am currently running a stress test on one of our RabbitMQ production
servers. The goal of the test is to discover how many queues RabbitMQ
can support this will tell us how many tenants we can scale to on a
single box. The test creates a large number of queues, then for each
queue adds three tasks to a thread pool (with 20 threads). One task
writes 100 messages, the next reads 100 messages and the third task
reads and writes 100 messages. The messages are read with basicGet. We
are currently using RabbitMQ 2.5.1 server and Java client on CentOS
64bit with 8 cores and 96Gb of Ram (the high water mark is set to 0.8).

As we ramp up the number of queues above 32,000 some of the tasks start
to fail in basicGet with an IOException, there is also an error in the
server log file. As the queues increase these errors become more
frequent.

Queues: 32000 task success: 95997 errors: 3
Queues: 64000 task success: 159997 errors: 32003
Queues: 128000 task success: 287997 errors: 96003

The errors occur in batches and then clear up, however the numbers for
the errors are very suspicious as they are (n - 32000 + 3). During the
whole test only 6-8Gb of ram is used and rabbit server uses at most 350%
cpu (of 800%). I am wondering if this is a sign that I am hitting the
limit in the number of queues, I have miss configured RabbitMQ or my
test is not valid. Any help to figure this out would be appreciated.

Regards,
Iain.


=ERROR REPORT==== 13-Sep-2011::09:09:41 ==connection <0.9484.159>, channel 1 - error:
{amqp_error,not_found,"no queue 'TestExchange_31997' in vhost '/'",
'basic.get'}

=ERROR REPORT==== 13-Sep-2011::09:09:41 ==** Generic server <0.9652.183> terminating
** Last message in was {deliver,
{delivery,true,false,none,<0.9651.183>,
{basic_message,

{resource,<<"/">>,exchange,<<"TestExchange_31998">>},
[<<>>],
{content,60,

{'P_basic',<<"text/plain">>,undefined,undefined,2,
0,undefined,undefined,undefined,undefined,

undefined,undefined,undefined,undefined,undefined},

<<152,0,10,116,101,120,116,47,112,108,97,105,110,2,0>>,
rabbit_framing_amqp_0_9_1,
[<<"Hello World!0 from q=hello">>]},

<<18,156,101,170,65,11,23,81,252,168,116,117,202,153,
120,220>>,
true},
undefined}}
** When Server state == {q,
{amqqueue,

{resource,<<"/">>,queue,<<"TestExchange_31998">>},
true,false,none,[],<0.9652.183>},
none,false,rabbit_variable_queue,
{vqstate,
{[],[]},
{0,{[],[]}},
{delta,undefined,0,undefined},
{0,{[],[]}},
{[],[]},
0,
{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
undefined,
{0,nil},
{qistate,

"/var/lib/rabbitmq/mnesia/rabbit at dev182/queues/3N66B3H9OM1Y8OASL0S3XK8TT
",
{{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
[]},
undefined,0,262144,
#Fun<rabbit_variable_queue.2.111108431>,[]},
{{client_msstate,msg_store_persistent,

<<81,127,46,195,11,139,99,60,19,29,15,68,36,202,
246,12>>,
{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
{state,245835,

"/var/lib/rabbitmq/mnesia/rabbit at dev182/msg_store_persistent"},
rabbit_msg_store_ets_index,

"/var/lib/rabbitmq/mnesia/rabbit at dev182/msg_store_persistent",
<0.240.0>,249932,241738,254029},
{client_msstate,msg_store_transient,

<<201,227,238,162,120,244,196,219,223,246,129,3,
224,211,0,204>>,
{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
{state,229447,

"/var/lib/rabbitmq/mnesia/rabbit at dev182/msg_store_transient"},
rabbit_msg_store_ets_index,

"/var/lib/rabbitmq/mnesia/rabbit at dev182/msg_store_transient",
<0.232.0>,233544,225350,237641}},
{sync,[],[],[],[]},

true,0,#Fun<rabbit_amqqueue_process.4.73542412>,
#Fun<rabbit_amqqueue_process.5.110277333>,0,0,
infinity,0,0,0,0,0,0,
{rates,
{{1315,904981,880372},0},
{{1315,904981,880372},0},
0.0,0.0,
{1315,904981,880372}},
{0,nil},
{0,nil},
{0,nil},
{0,nil},
0,0,
{rates,
{{1315,904981,880372},0},
{{1315,904981,880372},0},
0.0,0.0,
{1315,904981,880372}}},
{[],[]},
{[],[]},
undefined,undefined,
{1315904986880394,#Ref<0.0.298.230323>},
undefined,

{state,fine,{1315904986880410,#Ref<0.0.298.230325>}},
{dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
undefined,undefined}
** Reason for termination =** {{badmatch,{error,emlink}},
[{rabbit_queue_index,get_journal_handle,1},
{rabbit_queue_index,publish,5},
{rabbit_variable_queue,maybe_write_index_to_disk,3},
{rabbit_variable_queue,maybe_write_to_disk,4},
{rabbit_variable_queue,publish,5},
{rabbit_variable_queue,publish,4},
{rabbit_amqqueue_process,deliver_or_enqueue,2},
{rabbit_amqqueue_process,handle_call,3}]}





Tue Sep 13 09:09:41 UTC 2011: java.io.IOException
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:110)
at
com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:134)
at com.rabbitmq.client.impl.ChannelN.basicGet(ChannelN.java:806)
at
com.workday.messaging.test.BreakRabbitBreakTest.receiveMessages(BreakRab
bitBreakTest.java:200)
at
com.workday.messaging.test.BreakRabbitBreakTest.access$2(BreakRabbitBrea
kTest.java:198)
at
com.workday.messaging.test.BreakRabbitBreakTest$SendAndReceive.perform(B
reakRabbitBreakTest.java:262)
at
com.workday.messaging.test.BreakRabbitBreakTest$AbstractMessageRunner.ca
ll(BreakRabbitBreakTest.java:234)
at
com.workday.messaging.test.BreakRabbitBreakTest$AbstractMessageRunner.ca
ll(BreakRabbitBreakTest.java:1)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
reason: {#method<channel.close>(reply-code@4, reply-text=NOT_FOUND -
no queue 'TestExchange_31997' in vhost '/', class-id`,
method-idp),null,""}
at
com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:67)
at
com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(Bl
ockingValueOrException.java:33)
at
com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQ
Channel.java:337)
at
com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:210)
at
com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:128)
... 14 more
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
reason: {#method<channel.close>(reply-code@4, reply-text=NOT_FOUND -
no queue 'TestExchange_31997' in vhost '/', class-id`,
method-idp),null,""}
at
com.rabbitmq.client.impl.ChannelN.asyncShutdown(ChannelN.java:422)
at
com.rabbitmq.client.impl.ChannelN.processAsync(ChannelN.java:262)
at
com.rabbitmq.client.impl.AMQChannel.handleCompleteInboundCommand(AMQChan
nel.java:154)
at
com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:99)
at
com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:4
43)
Tue Sep 13 09:09:41 UTC 2011: java.io.IOException
at com.rabbitmq.client.impl.AMQChannel.wrap(AMQChannel.java:110)
at
com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:134)
at com.rabbitmq.client.impl.ChannelN.basicGet(ChannelN.java:806)
at
com.workday.messaging.test.BreakRabbitBreakTest.receiveMessages(BreakRab
bitBreakTest.java:200)
at
com.workday.messaging.test.BreakRabbitBreakTest.access$2(BreakRabbitBrea
kTest.java:198)
at
com.workday.messaging.test.BreakRabbitBreakTest$SendAndReceive.perform(B
reakRabbitBreakTest.java:262)
at
com.workday.messaging.test.BreakRabbitBreakTest$AbstractMessageRunner.ca
ll(BreakRabbitBreakTest.java:234)
at
com.workday.messaging.test.BreakRabbitBreakTest$AbstractMessageRunner.ca
ll(BreakRabbitBreakTest.java:1)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
reason: {#method<channel.close>(reply-code@4, reply-text=NOT_FOUND -
no queue 'TestExchange_31998' in vhost '/', class-id`,
method-idp),null,""}
at
com.rabbitmq.utility.ValueOrException.getValue(ValueOrException.java:67)
at
com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(Bl
ockingValueOrException.java:33)
at
com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQ
Channel.java:337)
at
com.rabbitmq.client.impl.AMQChannel.privateRpc(AMQChannel.java:210)
at
com.rabbitmq.client.impl.AMQChannel.exnWrappingRpc(AMQChannel.java:128)
... 14 more
Caused by: com.rabbitmq.client.ShutdownSignalException: channel error;
reason: {#method<channel.close>(reply-code@4, reply-text=NOT_FOUND -
no queue 'TestExchange_31998' in vhost '/', class-id`,
method-idp),null,""}
at
com.rabbitmq.client.impl.ChannelN.asyncShutdown(ChannelN.java:422)
at
com.rabbitmq.client.impl.ChannelN.processAsync(ChannelN.java:262)
at
com.rabbitmq.client.impl.AMQChannel.handleCompleteInboundCommand(AMQChan
nel.java:154)
at
com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:99)
at
com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:4
43)

Search Discussions

  • Matthew Sackman at Sep 13, 2011 at 3:47 pm
    Hi,
    ** Reason for termination ==
    ** {{badmatch,{error,emlink}},
    EMLINK: Too many links (POSIX.1)

    Rabbit doesn't create links though, so this is a rather odd error to
    occur, but for some reason, your kernel/filesystem is reporting this
    error. Do you have lots of links on your disks?

    Matthew
  • Iain Hull at Sep 14, 2011 at 7:47 am
    Hi Matthew,

    Thank you very much I think I have tracked the problem down to our use
    of the ext3 file system for /var/lib/rabbitmq.
  • Matthew Sackman at Sep 14, 2011 at 10:02 am
    Hi Iain,
    On Wed, Sep 14, 2011 at 02:47:33AM -0500, Iain Hull wrote:
    From http://wlug.org.nz/EMLINK
    So yes, this does mean that you are limited to 32000 subdirectories
    in one directory in ext3
    As a result the mnesia database is not able to support more than 32000
    queues on an ext3 filesystem. I will discuss the options with our ops
    guys and consider upgrading the file system to ext4 or similar and rerun
    my tests.
    Wow - I'd not considered that file systems these days would still have
    such limits... In my own testing a little while ago, I found that JFS
    had the best performance for my hardware when running Rabbit, but I
    didn't do any huge scaling tests so I wouldn't be surprised if that has
    even lower limits for such things.

    Certainly my testing of ext4 is not favourable. It seems to have very
    inconsistent performance. That said, if it works for you...

    Matthew
  • Iain Hull at Sep 14, 2011 at 11:47 am
    Hi Matthew,
    On Thur, Sep 14, 2011 at 11:03, Matthew Sackman wrote:
    Wow - I'd not considered that file systems these days would still have
    such limits... In my own testing a little while ago, I found that JFS
    had the best performance for my hardware when running Rabbit, but I
    didn't do any huge scaling tests so I wouldn't be surprised if that has
    even lower limits for such things.
    Certainly my testing of ext4 is not favourable. It seems to have very
    inconsistent performance. That said, if it works for you...
    Ok thanks I will add xfs and jfs to my mix for testing. Do you still
    have those results? And are they in a consumable format? Our ops team
    would be really interested in looking over them as the basis for our own
    testing.

    Regards,
    Iain.
  • Matthew Sackman at Sep 14, 2011 at 12:12 pm
    Hi Iain,
    On Wed, Sep 14, 2011 at 06:47:10AM -0500, Iain Hull wrote:
    Certainly my testing of ext4 is not favourable. It seems to have very
    inconsistent performance. That said, if it works for you...
    Ok thanks I will add xfs and jfs to my mix for testing. Do you still
    have those results? And are they in a consumable format? Our ops team
    would be really interested in looking over them as the basis for our own
    testing.
    I don't still have them, but I'm sure they'd have been simple enough -
    probably just measuring overall performance with something like
    MulticastMain (Java client) with various flags or I may have written
    some custom tests with the Erlang client - certainly you could just use
    MulticastMain with -f persistent -r X and tune X for each file system
    until you find the highest point at which the latency doesn't grow,
    although you'll need to leave it running for a few mins before you can
    be sure it's stable.

    That'll just measure the cost of writing msgs to disk, though only some
    bits of that writing are actually in the path of the msg - other bits
    may be done asynchronously at leisure, though ultimately, all bits are
    competing for disk bandwidth, so measuring the rate at which data is
    written to disk may be indicative of performance.

    You could also try creating a durable queue, filling it with a few
    million persistent msgs, cleanly shutting down and restarting rabbit,
    and then timing how long it takes to drain the queue from a client.

    Ideally, you should write your own tests which are as close a match to
    the access patterns you expect to see in your applications as possible.
    Only performance under those conditions is actually of any use to you.
    Especially when you start using features such as publisher confirms.

    Matthew
  • David Wragg at Sep 14, 2011 at 1:43 pm

    "Iain Hull" <iain.hull at workday.com> writes:
    Ok thanks I will add xfs and jfs to my mix for testing. Do you still
    have those results? And are they in a consumable format? Our ops team
    would be really interested in looking over them as the basis for our own
    testing.
    If you benchmark different Linux filesystems, be very careful to make
    sure that you are doing apples-to-apples comparisons. Different
    filesystems have different data integrity guarantees. In particular,
    ext3 and ext4 get different numbers mainly because they have different
    defaults for the 'barrier' option. So you need to explicitly set this
    option in order to perform a fair comparison. Last time I did this,
    their performance seemed roughly equivalent.

    --
    David Wragg
    Staff Engineer, RabbitMQ
    VMware, Inc.
  • Iain Hull at Sep 14, 2011 at 3:22 pm
    Thanks Matthew,

    I will ensure that we do this (ops/devops are in the US so haven't had a
    chance to talk to them yet). My primary concern is how they handle
    large numbers of subdirectories, as opposed to raw performance.

    Regards,
    Iain.

    -----Original Message-----
    From: David Wragg [mailto:david at rabbitmq.com]
    Sent: 14 September 2011 14:43
    To: Iain Hull
    Cc: Matthew Sackman; rabbitmq-discuss at lists.rabbitmq.com
    Subject: Re: [rabbitmq-discuss] Channels closed unexpectedly in Java
    client

    "Iain Hull" <iain.hull at workday.com> writes:
    Ok thanks I will add xfs and jfs to my mix for testing. Do you still
    have those results? And are they in a consumable format? Our ops team
    would be really interested in looking over them as the basis for our own
    testing.
    If you benchmark different Linux filesystems, be very careful to make
    sure that you are doing apples-to-apples comparisons. Different
    filesystems have different data integrity guarantees. In particular,
    ext3 and ext4 get different numbers mainly because they have different
    defaults for the 'barrier' option. So you need to explicitly set this
    option in order to perform a fair comparison. Last time I did this,
    their performance seemed roughly equivalent.

    --
    David Wragg
    Staff Engineer, RabbitMQ
    VMware, Inc.
  • Matthew Sackman at Sep 14, 2011 at 1:39 pm

    On Wed, Sep 14, 2011 at 11:02:49AM +0100, Matthew Sackman wrote:
    Wow - I'd not considered that file systems these days would still have
    such limits... In my own testing a little while ago, I found that JFS
    had the best performance for my hardware when running Rabbit, but I
    didn't do any huge scaling tests so I wouldn't be surprised if that has
    even lower limits for such things.
    ...except it doesn't. I've just written a test to, erm, test this:

    -module(test).
    -compile([export_all]).

    -include_lib("amqp_client/include/amqp_client.hrl").

    loadsa_queues() ->
    {ok, Conn} = amqp_connection:start(#amqp_params_network{}),
    {ok, Chan} = amqp_connection:open_channel(Conn),
    Method = #'basic.publish'{},
    Content = #amqp_msg { props = #'P_basic'{ delivery_mode = 2 },
    payload = <<0:8>> },
    [begin
    #'queue.declare_ok'{ queue = Q } =
    amqp_channel:call(Chan, #'queue.declare'{durable = true, exclusive = true}),
    amqp_channel:cast(Chan, Method#'basic.publish'{ routing_key = Q }, Content)
    end || _ <- lists:seq(1,100000)],
    amqp_connection:close(Conn).

    The reason for doing the publish is that the queue creates its directory
    lazily, so if you don't do the publish, the directory will never be
    created, so you won't stress the filesystem.

    So I ran that, at the same time as some
    rabbitmqctl list_queues | wc -l
    and
    watch "ls /home/matthew/ssd/rabbitmq-rabbit-mnesia/queues/|wc -l"
    and sure enough, it happily got up to the 100,000 queues I'd asked for.


    Matthew
  • Matthew Sackman at Sep 14, 2011 at 1:42 pm

    On Wed, Sep 14, 2011 at 02:39:05PM +0100, Matthew Sackman wrote:
    -module(test).
    -compile([export_all]).

    -include_lib("amqp_client/include/amqp_client.hrl").

    loadsa_queues() ->
    {ok, Conn} = amqp_connection:start(#amqp_params_network{}),
    {ok, Chan} = amqp_connection:open_channel(Conn),
    Method = #'basic.publish'{},
    Content = #amqp_msg { props = #'P_basic'{ delivery_mode = 2 },
    payload = <<0:8>> },
    [begin
    #'queue.declare_ok'{ queue = Q } =
    amqp_channel:call(Chan, #'queue.declare'{durable = true, exclusive = true}),
    amqp_channel:cast(Chan, Method#'basic.publish'{ routing_key = Q }, Content)
    end || _ <- lists:seq(1,100000)],
    amqp_connection:close(Conn).
    Oh yes, seeing as I've just run into this myself, there's currently a
    performance issue with deleting an awful lot of queues at the same time
    - reported at
    http://old.nabble.com/Problem-auto-deleting-large-number-of-queues--td32346666.html
    - we are working on a fix for this right now, but it wasn't ready in
    time for 2.6.1. The bug is actually in Erlang/OTP itself. So if you find
    you tests taking a huge amount of time deleting a lot of queues, it's
    that bug, and erm, thus "nothing to worry about" ;)

    Matthew
  • Iain Hull at Sep 14, 2011 at 3:28 pm
    Reply thanks for the heads up on the auto delete, I will test that once
    I get the queue size up. Currently my queues expire after 24 hours
    instead of auto deleting after a connection is dropped.

    Regards,
    Iain.

    -----Original Message-----
    From: rabbitmq-discuss-bounces at lists.rabbitmq.com
    [mailto:rabbitmq-discuss-bounces at lists.rabbitmq.com] On Behalf Of
    Matthew Sackman
    Sent: 14 September 2011 14:42
    To: rabbitmq-discuss at lists.rabbitmq.com
    Subject: Re: [rabbitmq-discuss] Channels closed unexpectedly in Java
    client
    On Wed, Sep 14, 2011 at 02:39:05PM +0100, Matthew Sackman wrote:
    -module(test).
    -compile([export_all]).

    -include_lib("amqp_client/include/amqp_client.hrl").

    loadsa_queues() ->
    {ok, Conn} = amqp_connection:start(#amqp_params_network{}),
    {ok, Chan} = amqp_connection:open_channel(Conn),
    Method = #'basic.publish'{},
    Content = #amqp_msg { props = #'P_basic'{ delivery_mode = 2 },
    payload = <<0:8>> },
    [begin
    #'queue.declare_ok'{ queue = Q } =
    amqp_channel:call(Chan, #'queue.declare'{durable = true,
    exclusive = true}),
    amqp_channel:cast(Chan, Method#'basic.publish'{ routing_key =
    Q }, Content)
    end || _ <- lists:seq(1,100000)],
    amqp_connection:close(Conn).
    Oh yes, seeing as I've just run into this myself, there's currently a
    performance issue with deleting an awful lot of queues at the same time
    - reported at

    http://old.nabble.com/Problem-auto-deleting-large-number-of-queues--td32
    346666.html
    - we are working on a fix for this right now, but it wasn't ready in
    time for 2.6.1. The bug is actually in Erlang/OTP itself. So if you find
    you tests taking a huge amount of time deleting a lot of queues, it's
    that bug, and erm, thus "nothing to worry about" ;)

    Matthew
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedSep 13, '11 at 3:43p
activeSep 14, '11 at 3:28p
posts11
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2022 Grokbase