We could probably do with better error reporting / handling in this
case, but I think this is what happened (certainly I did this and saw
the same errors)...

* Build a cluster with node1 = disc and node2 = ram
* Stop both nodes
* Start node2 *first*

At this point node2 does not have any disc nodes to catch up from... so
it initialises a fresh copy of the mnesia database in RAM and starts up
by itself.

* Start node1

At this point node1 knows that it needs to cluster with node2, but node2
doesn't agree, and has a different version of the database anyway. Hence
the error message.

At this point I was able to recover by stopping node2 again, and
starting node1 first, then node2.

I'll file a bug to make the errors clearer in this case, but for the
time being you should make sure to always bring at least one disc node
up first in any cluster.

In your case you should consider switching to a cluster with two disc
nodes anyway - only having one disc node is a SPOF.

Cheers, Simon
On 24/01/12 09:32, LuCo wrote:
Hello.

We have a RabbitMQ cluster across 2 machines. Machine 1 is created as
a disc node, and Machine 2 as a memory node.

These machines are restarted on the weekend every week. This has not
been a problem over the last 5 weeks (since trialling RabbitMQ) and
during this time the nodes have been little used, however the weekend
just passed resulted in the error below on the disc node just after
the machine on which the node runs was restarted:

"
=ERROR REPORT==== 22-Jan-2012::01:17:37 ===
Mnesia('rabbit at MACHINE1'): ** ERROR ** (core dumped to file: "c:/
Documents and Settings/user/Application Data/RabbitMQ/
MnesiaCore.rabbit at MACHINE1_1327_195058_860060")
** FATAL ** Failed to merge schema: Bad cookie in table definition
mirrored_sup_childspec: 'rabbit at MACHINE1' =
{cstruct,mirrored_sup_childspec,ordered_set,
['rabbit at MACHINE2,'rabbit at MACHINE1'],[],[],0,read_write,false,[],
[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],
{{1324,33984,878002},'rabbit at MACHINE1'},{{5,0},{'rabbit at MACHINE2',
{1324,34466,491790}}}}, 'rabbit at MACHINE2' =
{cstruct,mirrored_sup_childspec,ordered_set,['rabbit at MACHINE2'],[],[],
0,read_write,false,[],[],false,mirrored_sup_childspec,
[key,mirroring_pid,childspec],[],[],
{{1327,194914,615072},'rabbit at MACHINE2'},{{2,0},[]}}
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_subscr terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,<0.51.0>,57361}
** Reason for termination ==
** killed
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_monitor terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,<0.51.0>,[],[],true,[],undefined,[]}
** Reason for termination ==
** killed
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_recover terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,<0.51.0>,undefined,undefined,undefined,
0,false,
true,[]}
** Reason for termination ==
** killed
=ERROR REPORT==== 22-Jan-2012::01:17:44 ===
** Generic server mnesia_snmp_sup terminating
** Last message in was {'EXIT',<0.51.0>,killed}
** When Server state == {state,
{local,mnesia_snmp_sup},
simple_one_for_one,
[{child,undefined,mnesia_snmp_sup,
{mnesia_snmp_hook,start,[]},
transient,3000,worker,
[mnesia_snmp_sup,mnesia_snmp_hook,
supervisor]}],
undefined,0,86400000,[],mnesia_snmp_sup,
[]}
** Reason for termination ==
** killed
=INFO REPORT==== 22-Jan-2012::01:17:44 ===
application: mnesia
exited: {shutdown,{mnesia_sup,start,[normal,[]]}}
type: permanent
"

The memory node then had this error in the log just after the machine
on which the node runs was restarted:
"
=INFO REPORT==== 22-Jan-2012::01:12:08 ===
node 'rabbit at G1SVR2-IIS' lost 'rabbit'
=INFO REPORT==== 22-Jan-2012::01:12:08 ===
Statistics database started.
=INFO REPORT==== 22-Jan-2012::01:15:13 ===
Limiting to approx 924 file handles (829 sockets)
=INFO REPORT==== 22-Jan-2012::01:15:14 ===
application: mnesia
exited: stopped
type: permanent
=INFO REPORT==== 22-Jan-2012::01:15:14 ===
Memory limit set to 818MB of 2047MB total.
...<log continues with default initialisation of the node>...
"

As mentioned, the nodes have been seldom used and contained only 2
durable queues. This crash resulted in the nodes resuming their
default configuration (lost previously configured users).

The times on the machines above are the same so I am a little confused
at the messages on the MACHINE 2 (memory) which seems to have crashed
before MACHINE 1?

I can understand why this has possibly happened (one node up/one node
down when attempting to cluster on restart) but why has it not
happened the previous 5 restarts? What actually happens on a Rabbit
restart (following server restart) in a cluster scenario? Do I need a
custom start up script to cover all bases?

Any thoughts?

Daniel

_______________________________________________
rabbitmq-discuss mailing list
rabbitmq-discuss at lists.rabbitmq.com
https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss

--
Simon MacMullen
RabbitMQ, VMware

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
grouprabbitmq-discuss @
categoriesrabbitmq
postedJan 24, '12 at 9:32a
activeJan 24, '12 at 11:10a
posts2
users2
websiterabbitmq.com
irc#rabbitmq

2 users in discussion

Simon MacMullen: 1 post LuCo: 1 post

People

Translate

site design / logo © 2017 Grokbase