Here are my thoughts (inline below). I'm not the greatest clustering/HA expert, but from what I can gather, I think what you're seeing might be a result of the combination of a network partition and the `autoheal' mode.
On 26 Jun 2013, at 06:24, thomas wrote:
Thanks for the reply. I am using rabbitmq 3.1.1 with autoheal for
My test is just to find out the behavior of rabbitmq master should its slave
This is clearly documented on the website - RabbitMQ clusters /do not/ handle network partitions well. Things will probably go wrong. If the only issue you're running into is a busy producer getting blocked then that's great - the worst cases could have involved lost messages and all sorts.
Now, the intention with HA is *not* to go belly up in the face of disappearing slaves - quite the opposite. Quite why the master is suddenly blocking the producer I'm not sure - I'll discuss this with the rest of the team but it doesn't sound quite right to me.
Approximately 10 seconds after I shut down the network connection of
rabbit at B, the sending of messages to rabbit at A comes to a pause for close to
a minute and then continues as per normal.
Just to be clear, in a RabbitMQ cluster there is no concept of a master node - only mirror queues introduce the master/slave concept, and then only for queues that are mirrored. Each mirrored queue's master can end up on different nodes, depending on which you were connect to when it was declared.
Trying the same test but with no mirroring does not have any disruption.
That is not surprising, since the queue you're connected to isn't trying to replicate to any other nodes.
By the way, I have try out the same test for rabbitmq mirroring except that
there will be a delay of 10ms between sending every message and there was no
Can you share your test code? I'm surprised there is such a big difference, but still... If you've got a decent automated test that consistently replicates this failure mode and you're able to share it, we might be able to use it to make improvements to partition handling in the future. If not code, a test script (and description of how you're disabling the network link) with precise steps would do.
From my observation it can be seen that RabbitMQ server will exhibit weird
behavior under stress condition when a slave goes down. I presume this weird
behavior will only surface under stress condition.
As I've stated before, clustering is not designed to handle network partitions. In previous versions of rabbit, when a partition occurred it was up to a system administrator to handle it. This usually involved stopping some (or all) of the nodes and restarting them. The 'autoheal' feature simply attempts to replicate what a sysadmin might do, and therefore (whilst cluster nodes are restarting) there can be delays. The intention is not to block the master (or producers), but depending on how the cluster partition is handled, that might be one of the potential side effects.
We will certainly take a look to see if this is avoidable - I can't comment on that right now - and being able to repeat the tests could help with that too.
Is there a rough estimate for the maximum load that a RabbitMQ server can
take per second for mirroring? Thanks.
I'm not sure whether mirroring is really the problem here. Mirrored queues can handle plenty of load, although there /is/ some performance slow down due to replication (and esp. so if you're using publisher confirms) but there is a not a specific limit - this is hardware/resource and use-case dependent. I suspect the problem /you/ have could be a combination of network partition and cluster partition resolution (i.e., autoheal) mode.
Based on http://www.rabbitmq.com/partitions.html,
what I'd suggest is that you change 'autoheal' to 'ignore' and try your test again. That will either confirm or eliminate my suspicion quickly enough.