Got some really funky errors and a sudden crash of my entire cluster.
  Off hand, I'm GUESSING it's a disk error, but I'm not totally sure -
thought I'd see if anyone had any ideas?
Thanks!
Jason




The web ui returns the following:


{error,{error,badarg,
               [{ets,match,[rabbit_registry,{{exchange,'$1'},'$2'}],[]},
                {rabbit_registry,lookup_all,1,[]},
                {rabbit_mgmt_external_stats,list_registry_plugins,2,[]},
                {rabbit_mgmt_wm_overview,to_json,2,[]},
                {webmachine_resource,resource_call,3,[]},
                {webmachine_resource,do,3,[]},
                {webmachine_decision_core,resource_call,1,[]},
                {webmachine_decision_core,decision,1,[]}]}}




In the cluster.log file I see a TON of data and some really odd things like:


   {msg_status,198128,
                                <<8,75,158,13,139,133,182,107,227,74,148,135,78,
                                  181,221,89>>,
                                undefined,true,false,true,true,
                                {message_properties,undefined,true,false}},
                               {msg_status,198127,<<"???V??F??CT?
0k2????I?">>, undefined,true,false,true,true,
                                {message_properties,undefined,true,false}},
                               {msg_status,198126,
                                <<207,120,238,6,2,192,31,43,235,1,10,138,104,
                                  207,34,236>>,




=ERROR REPORT==== 14-Aug-2013::09:53:50 ===
** Generic server rabbit_mgmt_external_stats terminating
** Last message in was emit_update
** When Server state == {state,1024}
** Reason for termination ==
** {noproc,{gen_server,call,[rabbit_node_monitor,partitions,infinity]}}


** Reason for termination ==
** {{badmatch,{error,eio}},
     [{file_handle_cache,soft_close,1,[]},
      {file_handle_cache,hard_close,1,[]},
      {file_handle_cache,close,1,[]},
      {rabbit_msg_store,terminate,2,[]},
      {gen_server2,terminate,3,[]},
      {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}
** In 'terminate' callback with reason ==
** shutdown








=ERROR REPORT==== 14-Aug-2013::09:53:50 ===
** Generic server msg_store_transient terminating
** Last message in was {'EXIT',<0.238.0>,shutdown}
** When Server state == {msstate,


"/data/rabbitmq/rabbitmq/mnesia/cluster/msg_store_transient",
                             rabbit_msg_store_ets_index,
                             {state,397381,


"/data/rabbitmq/rabbitmq/mnesia/cluster/msg_store_transient"},
                             0,#Ref<0.0.0.1839>,
                             {dict,0,16,16,8,80,48,
                                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  []},
                                 {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                   [],[]}}},
                             undefined,0,1271,[],<0.291.0>,401478,393277,
                             405575,409672,
                             {set,0,16,16,8,80,48,
                                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  []},
                                 {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                   [],[]}}},
                             {dict,48,16,16,8,80,48,
                                 {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                  []},








--
Jason McIntosh
http://mcintosh.poetshome.com/blog/
573-424-7612
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130814/2032e250/attachment.htm>

Search Discussions

  • Michael Klishin at Aug 14, 2013 at 3:14 pm

    Jason McIntosh:


    {badmatch,{error,eio}}

    I'm far from being an expert on the storage subsystem but this suggests
    an EIO error happened when file handle cache tried to close a file.


    EIO is ?a generic error code that commonly indicates I/O failure?? [1] on Linux.


    1. https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf
    --
    MK


    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: signature.asc
    Type: application/pgp-signature
    Size: 495 bytes
    Desc: Message signed with OpenPGP using GPGMail
    URL: <http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/attachments/20130814/fb3c659d/attachment.pgp>
  • Simon MacMullen at Aug 14, 2013 at 3:17 pm

    On 14/08/13 16:06, Jason McIntosh wrote:
    Got some really funky errors and a sudden crash of my entire cluster.
    Off hand, I'm GUESSING it's a disk error, but I'm not totally sure -
    thought I'd see if anyone had any ideas?

    I think you're right.

    ** Reason for termination ==
    ** {{badmatch,{error,eio}},
    [{file_handle_cache,soft_close,1,[]},
    {file_handle_cache,hard_close,1,[]},
    {file_handle_cache,close,1,[]},
    {rabbit_msg_store,terminate,2,[]},
    {gen_server2,terminate,3,[]},
    {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,237}]}]}



    eio is I/O error. Reported by the file handle cache that means you're
    seeing errors accessing the disk.


    But when you say "sudden crash of my entire cluster" - do you mean a
    disk failure on one node caused failures on other nodes? That would be
    bad! Or were your nodes sharing a disk somehow?


    Cheers, Simon


    --
    Simon MacMullen
    RabbitMQ, Pivotal

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedAug 14, '13 at 3:06p
activeAug 14, '13 at 3:17p
posts3
users3
websiterabbitmq.com
irc#rabbitmq

People

Translate

site design / logo © 2017 Grokbase