Posting this to the list after some discussion on IRC with bob2351 on
irc.freenode.net.


We have a *slightly* strange situation with using RabbitMQ, we start it
under `runit`, and it effectively believes that it's running in the
foreground. I have anecdotal evidence that this causes other problems, but
at least not anything that hurts too often (i.e you lose "persistent
messages" in this setup)


That all aside, attached ( https://gist.github.com/leehambley/5773039 ) is
a stacktrace from a problematic box, we couldn't get it to recover (single
node, single replica, etc, etc) - we simply deleted the mnesia database,
which worked well enough.


Some information about our environment:


$ erl --version
Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
[async-threads:0] [kernel-poll:false]
$ dpkg --list | grep rabbit
ii rabbitmq-server 3.0.4-1 AMQP server written in Erlang
$ sudo RABBITMQ_NODENAME=ourproject rabbitmqctl status
Status of node ourproject at carla ...
[{pid,8055},
  {running_applications,
      [{rabbitmq_management,"RabbitMQ Management Console","3.0.4"},
       {rabbitmq_management_agent,"RabbitMQ Management Agent","3.0.4"},
       {rabbit,"RabbitMQ","3.0.4"},
       {os_mon,"CPO CXC 138 46","2.2.7"},
       {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.0.4"},
       {webmachine,"webmachine","1.9.1-rmq3.0.4-git52e62bc"},
       {mochiweb,"MochiMedia Web Server","2.3.1-rmq3.0.4-gitd541e9a"},
       {xmerl,"XML parser","1.2.10"},
       {inets,"INETS CXC 138 49","5.7.1"},
       {mnesia,"MNESIA CXC 138 12","4.5"},
       {amqp_client,"RabbitMQ AMQP Client","3.0.4"},
       {sasl,"SASL CXC 138 11","2.1.10"},
       {stdlib,"ERTS CXC 138 10","1.17.5"},
       {kernel,"ERTS CXC 138 10","2.14.5"}]},
  {os,{unix,linux}},
  {erlang_version,
      "Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
[async-threads:30] [kernel-poll:true]\n"},
  {memory,
      [{total,33984216},
       {connection_procs,756760},
       {queue_procs,325576},
       {plugins,218728},
       {other_proc,9518440},
       {mnesia,93728},
       {mgmt_db,148472},
       {msg_index,71528},
       {other_ets,1145600},
       {binary,604208},
       {code,17266925},
       {atom,1550457},
       {other_system,2283794}]},
  {vm_memory_high_watermark,0.4},
  {vm_memory_limit,6656894566},
  {disk_free_limit,1000000000},
  {disk_free,11247643770880},
  {file_descriptors,
      [{total_limit,924},
       {total_used,23},
       {sockets_limit,829},
       {sockets_used,12}]},
  {processes,[{limit,1048576},{used,345}]},
  {run_queue,0},
  {uptime,2692}]
...done.




I believe this bug is already being tracked internally, and I post the
report here in the hope that I'll have a place to attach a snapshot of an
mnesia database the next time this happens to us, or that someone else
might find this report and be able to contribute. Finally, selfishly, in
the hope that I'll get notified when this gets fixed, and I upgrade, and
sleep at night again.


- Lee Hambley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130613/b70dd7e6/attachment.htm>

Search Discussions

  • Simon MacMullen at Jun 13, 2013 at 12:49 pm
    Hi Lee. I would be interested to know how you got the machine into that
    state.


    There is a bug with a similar stack trace that will be fixed in the next
    release - but I don't think it's the same bug. In your case we are
    seeing a message which has been published and delivered according to the
    queue index, but only published (and not delivered) according to the
    queue index's journal. As the journal should always record the same
    state or newer as the main index, this should be impossible.


    So to eliminate obvious causes of weirdness first: are you usuing an
    unusual filesystem, or mounting the filesystem with unusual options?


    Cheers, Simon

    On 13/06/13 12:36, Lee Hambley wrote:
    Posting this to the list after some discussion on IRC with bob2351 on
    irc.freenode.net.

    We have a *slightly* strange situation with using RabbitMQ, we start it
    under `runit`, and it effectively believes that it's running in the
    foreground. I have anecdotal evidence that this causes other problems,
    but at least not anything that hurts too often (i.e you lose "persistent
    messages" in this setup)

    That all aside, attached ( https://gist.github.com/leehambley/5773039 )
    is a stacktrace from a problematic box, we couldn't get it to recover
    (single node, single replica, etc, etc) - we simply deleted the mnesia
    database, which worked well enough.

    Some information about our environment:

    $ erl --version
    Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
    [async-threads:0] [kernel-poll:false]
    $ dpkg --list | grep rabbit
    ii rabbitmq-server 3.0.4-1 AMQP server written in Erlang
    $ sudo RABBITMQ_NODENAME=ourproject rabbitmqctl status
    Status of node ourproject at carla ...
    [{pid,8055},
    {running_applications,
    [{rabbitmq_management,"RabbitMQ Management Console","3.0.4"},
    {rabbitmq_management_agent,"RabbitMQ Management Agent","3.0.4"},
    {rabbit,"RabbitMQ","3.0.4"},
    {os_mon,"CPO CXC 138 46","2.2.7"},
    {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.0.4"},
    {webmachine,"webmachine","1.9.1-rmq3.0.4-git52e62bc"},
    {mochiweb,"MochiMedia Web Server","2.3.1-rmq3.0.4-gitd541e9a"},
    {xmerl,"XML parser","1.2.10"},
    {inets,"INETS CXC 138 49","5.7.1"},
    {mnesia,"MNESIA CXC 138 12","4.5"},
    {amqp_client,"RabbitMQ AMQP Client","3.0.4"},
    {sasl,"SASL CXC 138 11","2.1.10"},
    {stdlib,"ERTS CXC 138 10","1.17.5"},
    {kernel,"ERTS CXC 138 10","2.14.5"}]},
    {os,{unix,linux}},
    {erlang_version,
    "Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
    [async-threads:30] [kernel-poll:true]\n"},
    {memory,
    [{total,33984216},
    {connection_procs,756760},
    {queue_procs,325576},
    {plugins,218728},
    {other_proc,9518440},
    {mnesia,93728},
    {mgmt_db,148472},
    {msg_index,71528},
    {other_ets,1145600},
    {binary,604208},
    {code,17266925},
    {atom,1550457},
    {other_system,2283794}]},
    {vm_memory_high_watermark,0.4},
    {vm_memory_limit,6656894566},
    {disk_free_limit,1000000000},
    {disk_free,11247643770880},
    {file_descriptors,
    [{total_limit,924},
    {total_used,23},
    {sockets_limit,829},
    {sockets_used,12}]},
    {processes,[{limit,1048576},{used,345}]},
    {run_queue,0},
    {uptime,2692}]
    ...done.


    I believe this bug is already being tracked internally, and I post the
    report here in the hope that I'll have a place to attach a snapshot of
    an mnesia database the next time this happens to us, or that someone
    else might find this report and be able to contribute. Finally,
    selfishly, in the hope that I'll get notified when this gets fixed, and
    I upgrade, and sleep at night again.

    - Lee Hambley


    _______________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss



    --
    Simon MacMullen
    RabbitMQ, Pivotal
  • Simon MacMullen at Jun 13, 2013 at 2:08 pm
    Hmm, do you still have the Mnesia directory that wouldn't boot? Are you
    able to reproduce this?


    Cheers, Simon

    On 13/06/13 14:00, Lee Hambley wrote:
    Hi Simon,

    Nothing strange of that sort, we use runit to manage the process (in out
    env we need unprivileged users to be able to restart selected services,
    using runit that's as simple as chowning a named pipe).

    In case it matters, on STOP runit sends TERM, waits 7s for the process
    to go away before resorting to sending KILL. ( the follow up KILL is our
    design, but in keeping with runit principles, the 7s timeout is internal
    to runit)

    We've no special file system configuration, these machines are i7 with
    raid spinning disks (not sure what configuration, probably 2 drives.

    The hardware is practically new <100h usage, and was burned in and
    stress tested at install time.

    Happy to post fstabs, raid logs etc if you tell me what you need (and in
    weird cases, how to get it).

    On Thursday, June 13, 2013, Simon MacMullen wrote:

    Hi Lee. I would be interested to know how you got the machine into
    that state.

    There is a bug with a similar stack trace that will be fixed in the
    next release - but I don't think it's the same bug. In your case we
    are seeing a message which has been published and delivered
    according to the queue index, but only published (and not delivered)
    according to the queue index's journal. As the journal should always
    record the same state or newer as the main index, this should be
    impossible.

    So to eliminate obvious causes of weirdness first: are you usuing an
    unusual filesystem, or mounting the filesystem with unusual options?

    Cheers, Simon

    On 13/06/13 12:36, Lee Hambley wrote:

    Posting this to the list after some discussion on IRC with
    bob2351 on
    irc.freenode.net <http://irc.freenode.net>.

    We have a *slightly* strange situation with using RabbitMQ, we
    start it
    under `runit`, and it effectively believes that it's running in the
    foreground. I have anecdotal evidence that this causes other
    problems,
    but at least not anything that hurts too often (i.e you lose
    "persistent
    messages" in this setup)

    That all aside, attached (
    https://gist.github.com/__leehambley/5773039
    <https://gist.github.com/leehambley/5773039> )
    is a stacktrace from a problematic box, we couldn't get it to
    recover
    (single node, single replica, etc, etc) - we simply deleted the
    mnesia
    database, which worked well enough.

    Some information about our environment:

    $ erl --version
    Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:8:8] [rq:8]
    [async-threads:0] [kernel-poll:false]
    $ dpkg --list | grep rabbit
    ii rabbitmq-server 3.0.4-1 AMQP server written in
    Erlang
    $ sudo RABBITMQ_NODENAME=ourproject rabbitmqctl status
    Status of node ourproject at carla ...
    [{pid,8055},
    {running_applications,
    [{rabbitmq_management,"__RabbitMQ Management
    Console","3.0.4"},
    {rabbitmq_management_agent,"__RabbitMQ Management
    Agent","3.0.4"},
    {rabbit,"RabbitMQ","3.0.4"},
    {os_mon,"CPO CXC 138 46","2.2.7"},
    {rabbitmq_web_dispatch,"__RabbitMQ Web
    Dispatcher","3.0.4"},
    {webmachine,"webmachine","1.9.__1-rmq3.0.4-git52e62bc"},
    {mochiweb,"MochiMedia Web
    Server","2.3.1-rmq3.0.4-__gitd541e9a"},
    {xmerl,"XML parser","1.2.10"},
    {inets,"INETS CXC 138 49","5.7.1"},
    {mnesia,"MNESIA CXC 138 12","4.5"},
    {amqp_client,"RabbitMQ AMQP Client","3.0.4"},
    {sasl,"SASL CXC 138 11","2.1.10"},
    {stdlib,"ERTS CXC 138 10","1.17.5"},
    {kernel,"ERTS CXC 138 10","2.14.5"}]},
    {os,{unix,linux}},
    {erlang_version,
    "Erlang R14B04 (erts-5.8.5) [source] [64-bit]
    [smp:8:8] [rq:8]
    [async-threads:30] [kernel-poll:true]\n"},
    {memory,
    [{total,33984216},
    {connection_procs,756760},
    {queue_procs,325576},
    {plugins,218728},
    {other_proc,9518440},
    {mnesia,93728},
    {mgmt_db,148472},
    {msg_index,71528},
    {other_ets,1145600},
    {binary,604208},
    {code,17266925},
    {atom,1550457},
    {other_system,2283794}]},
    {vm_memory_high_watermark,0.4}__,
    {vm_memory_limit,6656894566},
    {disk_free_limit,1000000000},
    {disk_free,11247643770880},
    {file_descriptors,
    [{total_limit,924},
    {total_used,23},
    {sockets_limit,829},
    {sockets_used,12}]},
    {processes,[{limit,1048576},{__used,345}]},
    {run_queue,0},
    {uptime,2692}]
    ...done.


    I believe this bug is already being tracked internally, and I
    post the
    report here in the hope that I'll have a place to attach a
    snapshot of
    an mnesia database the next time this happens to us, or that someone
    else might find this report and be able to contribute. Finally,
    selfishly, in the hope that I'll get notified when this gets
    fixed, and
    I upgrade, and sleep at night again.

    - Lee Hambley


    _________________________________________________
    rabbitmq-discuss mailing list
    rabbitmq-discuss at lists.rabbitmq.com
    https://lists.rabbitmq.com/__cgi-bin/mailman/listinfo/__rabbitmq-discuss
    <https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss>



    --
    Simon MacMullen
    RabbitMQ, Pivotal



    --
    Lee Hambley
    --
    http://lee.hambley.name/
    +49 (0) 170 298 5667



    --
    Simon MacMullen
    RabbitMQ, Pivotal
  • Emile Joubert at Jun 19, 2013 at 12:12 pm
    Hi Lee,

    On 13/06/13 12:36, Lee Hambley wrote:
    That all aside, attached ( https://gist.github.com/leehambley/5773039 )
    is a stacktrace from a problematic box, we couldn't get it to recover
    (single node, single replica, etc, etc) - we simply deleted the mnesia
    database, which worked well enough.

    We've managed to reproduce this problem, but only by copying an older
    copy of an Mnesia directory over a newer copy of itself without clearing
    all the files first. Is it possible that the same happened on your side?
    The solution in that case is to clear the mnesia directory before
    restoring. The integrity of the broker will be compromised if the Mnesia
    directory is modified, so avoid doing that.




    -Emile

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedJun 13, '13 at 11:36a
activeJun 19, '13 at 12:12p
posts4
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase