FAQ
Hello,


We had a problem today that was caused by wiped rabbit_serial file on one of the nodes. Our rabbitmq cluster consisting of two nodes become inaccessible.
There is a ticket: https://github.com/rabbitmq/rabbitmq-server/issues/17


I think that rabbitmq should be able to recover from this situation by maybe removing invalid rabbit_serial file, what do you think?


Regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130528/286a6aed/attachment.htm>

Search Discussions

  • Matthias Radestock at May 29, 2013 at 7:49 am

    On 28/05/13 23:42, Denisenko, Mikhail (NIH/NLM/NCBI) [C] wrote:
    We had a problem today that was caused by wiped rabbit_serial file on
    one of the nodes. Our rabbitmq cluster consisting of two nodes become
    inaccessible.

    There is a ticket: https://github.com/rabbitmq/rabbitmq-server/issues/17

    I think that rabbitmq should be able to recover from this situation by
    maybe removing invalid rabbit_serial file, what do you think?

    As Emile mentioned in the ticket, rabbit cannot recover from arbitrary
    filesystem corruption.


    The specific issue of rabbit_serial being empty actually has come up
    once before - see
    http://rabbitmq.1065348.n5.nabble.com/Cannot-start-guid-generator-rabbitmq-td4415.html.
    So I have filed a feature request to investigate whether we should
    handle this particular case of corruption more gracefully.


    I recommend you look into what may have caused the corruption. While
    there is an easy workaround in this particular instance, the same isn't
    true for other files, and corruptions there could easily lead to data
    loss, possibly undetected.




    Regards,


    Matthias.
  • Denisenko, Mikhail (NIH/NLM/NCBI) [C] at May 29, 2013 at 2:43 pm
    Thanks for the feature request. In our case this file become corrupted after restart of rabbitmq. Our restart procedure is to try graceful restart first and after some time try kill -9 if it didn't exit. I suspect that in this case it didn't die within timeout and was killed with -9 and maybe it didn't flush buffer for this file.
    ________________________________________
    From: Matthias Radestock [matthias at rabbitmq.com]
    Sent: Wednesday, May 29, 2013 3:49 AM
    To: Discussions about RabbitMQ
    Cc: Denisenko, Mikhail (NIH/NLM/NCBI) [C]
    Subject: Re: [rabbitmq-discuss] empty rabbit_serial file causes rabbitmq cluster to hang

    On 28/05/13 23:42, Denisenko, Mikhail (NIH/NLM/NCBI) [C] wrote:
    We had a problem today that was caused by wiped rabbit_serial file on
    one of the nodes. Our rabbitmq cluster consisting of two nodes become
    inaccessible.

    There is a ticket: https://github.com/rabbitmq/rabbitmq-server/issues/17

    I think that rabbitmq should be able to recover from this situation by
    maybe removing invalid rabbit_serial file, what do you think?

    As Emile mentioned in the ticket, rabbit cannot recover from arbitrary
    filesystem corruption.


    The specific issue of rabbit_serial being empty actually has come up
    once before - see
    http://rabbitmq.1065348.n5.nabble.com/Cannot-start-guid-generator-rabbitmq-td4415.html.
    So I have filed a feature request to investigate whether we should
    handle this particular case of corruption more gracefully.


    I recommend you look into what may have caused the corruption. While
    there is an easy workaround in this particular instance, the same isn't
    true for other files, and corruptions there could easily lead to data
    loss, possibly undetected.




    Regards,


    Matthias.
  • Matthias Radestock at May 29, 2013 at 3:04 pm

    On 29/05/13 15:43, Denisenko, Mikhail (NIH/NLM/NCBI) [C] wrote:
    Thanks for the feature request. In our case this file become
    corrupted after restart of rabbitmq. Our restart procedure is to try
    graceful restart first and after some time try kill -9 if it didn't
    exit. I suspect that in this case it didn't die within timeout and
    was killed with -9 and maybe it didn't flush buffer for this file.

    Hmm. Doesn't quite add up. The file gets written during startup, not
    shutdown. And the time window should be very short, since the file is
    tiny. And we call 'sync' after the writing.


    Matthias.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedMay 28, '13 at 10:42p
activeMay 29, '13 at 3:04p
posts4
users2
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase