We run a small 5 node (1 disk) cluster with verion 2.6.1 and erlang R14B04.

Whenever we send a large message, ~100MB, the clustered nodes loose
all remote queues and bindings.
The queues and bindings are left just on the node where they were created.

Has anybody else experienced the sudden drop of queues and bindings on
clustered nodes?

- Irmo

Search Discussions

  • Simon MacMullen at Oct 25, 2011 at 10:21 am

    On 25/10/11 11:05, Irmo Manie wrote:
    Whenever we send a large message, ~100MB, the clustered nodes loose
    all remote queues and bindings.
    The queues and bindings are left just on the node where they were created.

    Has anybody else experienced the sudden drop of queues and bindings on
    clustered nodes?
    No, really not. That sounds almost like the cluster is getting
    partitioned, which really should not happen. What, if any, error
    messages are you seeing in the logs?

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Simon MacMullen at Oct 25, 2011 at 10:36 am

    On 25/10/11 11:21, Simon MacMullen wrote:
    That sounds almost like the cluster is getting partitioned
    Matthias points out that this can indeed happen, if transmitting one
    message across the cluster takes longer than the configured ticktime
    (see net_ticktime at http://www.erlang.org/doc/man/kernel_app.html). But
    by default that's 60s, and we don't change it. Is your cluster that slow?

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
  • Irmo Manie at Oct 25, 2011 at 10:54 am
    It does take more than 60 seconds yes.
    For testing we use some remote virtual machines which are a bit
    sluggish (read: horribly slow) on the I/O.
    But the nodes already get partitioned in about 15-20 seconds which
    doesn't really match the default earliest possibility of 45 seconds.

    Next to this, there is no way that the cluster will try to
    re-partition after such incident?

    - Irmo

    Any reason why it won't re-partition after the message is finally delivered
    On Tue, Oct 25, 2011 at 12:36 PM, Simon MacMullen wrote:
    On 25/10/11 11:21, Simon MacMullen wrote:

    That sounds almost like the cluster is getting partitioned
    Matthias points out that this can indeed happen, if transmitting one message
    across the cluster takes longer than the configured ticktime (see
    net_ticktime at http://www.erlang.org/doc/man/kernel_app.html). But by
    default that's 60s, and we don't change it. Is your cluster that slow?

    Cheers, Simon

    --
    Simon MacMullen
    RabbitMQ, VMware
    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
  • Matthias Radestock at Oct 27, 2011 at 7:07 am
    Irmo,
    On 25/10/11 11:54, Irmo Manie wrote:
    It does take more than 60 seconds yes.
    For testing we use some remote virtual machines which are a bit
    sluggish (read: horribly slow) on the I/O.
    But the nodes already get partitioned in about 15-20 seconds which
    doesn't really match the default earliest possibility of 45 seconds.
    Depends on when the last tick was sent, e.g. it may have been sent 30
    seconds before the large message is sent.

    Btw,
    http://erlang.2086793.n4.nabble.com/node-to-node-message-passing-td2536251.html
    describes the issue.
    Next to this, there is no way that the cluster will try to
    re-partition after such incident?
    It should join up again when one node attempts to talk to the other. Not
    sure what would trigger that in rabbit; queue/exchange/binding
    creation/deletion perhaps (due to the mnesia synchronisation that needs
    to happen as part of that).

    Regards,

    Matthias.
  • Irmo Manie at Oct 27, 2011 at 7:59 am
    Matthias,

    On Thu, Oct 27, 2011 at 9:07 AM, Matthias Radestock
    wrote:
    Irmo,
    Depends on when the last tick was sent, e.g. it may have been sent 30
    seconds before the large message is sent.

    Btw,
    http://erlang.2086793.n4.nabble.com/node-to-node-message-passing-td2536251.html
    describes the issue.
    Ok thanks, that clarifies the behavior then. I'll prepare our internal
    communication tooling to auto multi-part big messages to prevent all
    this.
    It should join up again when one node attempts to talk to the other. Not
    sure what would trigger that in rabbit; queue/exchange/binding
    creation/deletion perhaps (due to the mnesia synchronisation that needs to
    happen as part of that).
    I haven't tested or tried what triggers a full rejoin to the cluster,
    including getting all queue and binding info again.
    I actually expected this to happen automatically and I think it should, right?

    Cheers,
    Irmo
  • Matthias Radestock at Oct 27, 2011 at 8:15 am
    Irmo,
    On 27/10/11 08:59, Irmo Manie wrote:
    I haven't tested or tried what triggers a full rejoin to the cluster,
    including getting all queue and binding info again.
    That just results in local reads. Try creating or deleting an
    exchange/binding/queue.
    I actually expected this to happen automatically and I think it
    should, right?
    Erlang/OTP doesn't do this for us. So it would have to be done in rabbit
    code.


    Matthias.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedOct 25, '11 at 10:05a
activeOct 27, '11 at 8:15a
posts7
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase