FAQ
Hi folks,

While testing out ActiveMQ I've been building clusters
VirtualBox. I've been spinning up two 3-node Replicated
LevelDB stores on my laptop.

I've noticed that the clusters can sometimes get into a
state where none of the nodes is the master. It appears
to me as though it's an issue with talking to zookeeper.

I'm assuming the issue is related to how few cpu cycles
the clusters are getting as part of this environment, but
the fact that the clusters don't ever recover makes me
wonder if Replicated LevelDB is still a work in progress?

I was testing it because I saw that MasterSlave pairs
were deprecated in favor of one of the share everything
solutions or Replicated LevelDB.

Typically I'll see a message from this code:

./activemq-leveldb-store/src/main/scala/org/apache/activemq/leveldb/replicated/groups/ChangeListener.scala:102:
        ChangeListenerSupport.LOG.warn("listeners are taking too long
to process the events")

and then nothing. No more attempts to talk to the
zookeeper cluster, no attempts to elect a new master.

I haven't dug deeply into the issue yet, I wanted to
ask you folks about the status of the code first.

Jim

Search Discussions

  • Tim Bain at Mar 4, 2015 at 1:53 pm
    People reported similar high-level symptoms against 5.10.0 several months
    back (you can search the archives on Nabble), and I don't recall any
    discussion of anyone finding a solution. But JIRA is the authoritative
    place to find out whether anyone has reported and/or fixed this issue (or
    any other).
    On Mar 3, 2015 8:23 AM, "James A. Robinson" wrote:

    Hi folks,

    While testing out ActiveMQ I've been building clusters
    VirtualBox. I've been spinning up two 3-node Replicated
    LevelDB stores on my laptop.

    I've noticed that the clusters can sometimes get into a
    state where none of the nodes is the master. It appears
    to me as though it's an issue with talking to zookeeper.

    I'm assuming the issue is related to how few cpu cycles
    the clusters are getting as part of this environment, but
    the fact that the clusters don't ever recover makes me
    wonder if Replicated LevelDB is still a work in progress?

    I was testing it because I saw that MasterSlave pairs
    were deprecated in favor of one of the share everything
    solutions or Replicated LevelDB.

    Typically I'll see a message from this code:


    ./activemq-leveldb-store/src/main/scala/org/apache/activemq/leveldb/replicated/groups/ChangeListener.scala:102:
    ChangeListenerSupport.LOG.warn("listeners are taking too long
    to process the events")

    and then nothing. No more attempts to talk to the
    zookeeper cluster, no attempts to elect a new master.

    I haven't dug deeply into the issue yet, I wanted to
    ask you folks about the status of the code first.

    Jim
  • James A. Robinson at Mar 4, 2015 at 8:31 pm
    Thanks. I'm pretty sure AMQ-5082 is what I'm seeing on 5.11.1.
    I'll see if I can get the cycles to set up a unit test to replicate the
    issue.
    On Wed, Mar 4, 2015 at 5:52 AM, Tim Bain wrote:
    People reported similar high-level symptoms against 5.10.0 several months
    back (you can search the archives on Nabble), and I don't recall any
    discussion of anyone finding a solution. But JIRA is the authoritative
    place to find out whether anyone has reported and/or fixed this issue (or
    any other).
    On Mar 3, 2015 8:23 AM, "James A. Robinson" wrote:

    Hi folks,

    While testing out ActiveMQ I've been building clusters
    VirtualBox. I've been spinning up two 3-node Replicated
    LevelDB stores on my laptop.

    I've noticed that the clusters can sometimes get into a
    state where none of the nodes is the master. It appears
    to me as though it's an issue with talking to zookeeper.

    I'm assuming the issue is related to how few cpu cycles
    the clusters are getting as part of this environment, but
    the fact that the clusters don't ever recover makes me
    wonder if Replicated LevelDB is still a work in progress?

    I was testing it because I saw that MasterSlave pairs
    were deprecated in favor of one of the share everything
    solutions or Replicated LevelDB.

    Typically I'll see a message from this code:


    ./activemq-leveldb-store/src/main/scala/org/apache/activemq/leveldb/replicated/groups/ChangeListener.scala:102:
    ChangeListenerSupport.LOG.warn("listeners are taking too long
    to process the events")

    and then nothing. No more attempts to talk to the
    zookeeper cluster, no attempts to elect a new master.

    I haven't dug deeply into the issue yet, I wanted to
    ask you folks about the status of the code first.

    Jim
  • James A. Robinson at Mar 8, 2015 at 4:53 am

    On Wed, Mar 4, 2015 at 12:29 PM, James A. Robinson wrote:
    Thanks. I'm pretty sure AMQ-5082 is what I'm seeing on 5.11.1.
    I'll see if I can get the cycles to set up a unit test to replicate the
    issue.
    I think I've got the use a case represented for

    https://issues.apache.org/jira/browse/AMQ-5082

    I could use some advice from others to confirm whether
    or not the underlying assumptions behind my test are valid.

    My assumption is that if 3 ElectingLevelDBStore are running,
    and they lose quorum due to a zk timeout, that once the
    timeout issue is resolved that the pool ought to try and
    re-establish a quorum. Is that a fair assumption?

    https://github.com/jimrobinson/activemq/commit/58b7198880f5296af6b2e4e9efbbdfdb51220411

    Jim
  • James A. Robinson at Mar 10, 2015 at 11:58 pm
    Working my way through the code and the debug log from
    the test, I see that the ZooKeeper group is getting emptied
    out after session expiration occurs:

    before the timeout:

    2015-03-10 12:09:50,614 | DEBUG | ActiveMQ Task | ZooKeeper group for
    0000000001 changed: Map(foo ->
    ListBuffer((0000000000,{"id":"foo","container":null,"address":"tcp://localhost:62092","position":-1,"weight":1,"elected":"0000000000"}),
    (0000000001,{"id":"foo","container":null,"address":null,"position":-1,"weight":1,"elected":null}),
    (0000000002,{"id":"foo","container":null,"address":null,"position":-1,"weight":1,"elected":null})))

    after the timeout:

    2015-03-10 12:10:53,490 | DEBUG | ZooKeeper state change dispatcher thread
    ZooKeeper group for 0000000001 changed: Map()
    On Sat, Mar 7, 2015 at 8:52 PM, James A. Robinson wrote:

    I think I've got the use a case represented for

    https://issues.apache.org/jira/browse/AMQ-5082

    I could use some advice from others to confirm whether
    or not the underlying assumptions behind my test are valid.

    My assumption is that if 3 ElectingLevelDBStore are running,
    and they lose quorum due to a zk timeout, that once the
    timeout issue is resolved that the pool ought to try and
    re-establish a quorum. Is that a fair assumption?


    https://github.com/jimrobinson/activemq/commit/58b7198880f5296af6b2e4e9efbbdfdb51220411
  • James A. Robinson at Mar 11, 2015 at 8:53 pm
    So I think the problem is that

    org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker

    doesn't appear to handle the event of a session disconnect.
    Or at least the version used by ActiveMQ doesn't...

    If I force tree to be rebuilt on a reconnect, my earlier unit test
    passes:

    https://github.com/jimrobinson/activemq/commit/d272a116ff5c0916a6044d657f99df48f264bd2a
    On Tue, Mar 10, 2015 at 4:57 PM, James A. Robinson wrote:
    Working my way through the code and the debug log from
    the test, I see that the ZooKeeper group is getting emptied
    out after session expiration occurs:

    before the timeout:

    2015-03-10 12:09:50,614 | DEBUG | ActiveMQ Task | ZooKeeper group for
    0000000001 changed: Map(foo ->
    ListBuffer((0000000000,{"id":"foo","container":null,"address":"tcp://localhost:62092","position":-1,"weight":1,"elected":"0000000000"}),
    (0000000001,{"id":"foo","container":null,"address":null,"position":-1,"weight":1,"elected":null}),
    (0000000002,{"id":"foo","container":null,"address":null,"position":-1,"weight":1,"elected":null})))

    after the timeout:

    2015-03-10 12:10:53,490 | DEBUG | ZooKeeper state change dispatcher thread |
    ZooKeeper group for 0000000001 changed: Map()


    On Sat, Mar 7, 2015 at 8:52 PM, James A. Robinson wrote:

    I think I've got the use a case represented for

    https://issues.apache.org/jira/browse/AMQ-5082

    I could use some advice from others to confirm whether
    or not the underlying assumptions behind my test are valid.

    My assumption is that if 3 ElectingLevelDBStore are running,
    and they lose quorum due to a zk timeout, that once the
    timeout issue is resolved that the pool ought to try and
    re-establish a quorum. Is that a fair assumption?


    https://github.com/jimrobinson/activemq/commit/58b7198880f5296af6b2e4e9efbbdfdb51220411
  • Gary Tully at Mar 11, 2015 at 10:28 pm
    I think you are correct here. The rebuild should work so long as the
    session has not expired.
    On 11 March 2015 at 20:51, James A. Robinson wrote:
    So I think the problem is that

    org.linkedin.zookeeper.tracker.ZooKeeperTreeTracker

    doesn't appear to handle the event of a session disconnect.
    Or at least the version used by ActiveMQ doesn't...

    If I force tree to be rebuilt on a reconnect, my earlier unit test
    passes:

    https://github.com/jimrobinson/activemq/commit/d272a116ff5c0916a6044d657f99df48f264bd2a
    On Tue, Mar 10, 2015 at 4:57 PM, James A. Robinson wrote:
    Working my way through the code and the debug log from
    the test, I see that the ZooKeeper group is getting emptied
    out after session expiration occurs:

    before the timeout:

    2015-03-10 12:09:50,614 | DEBUG | ActiveMQ Task | ZooKeeper group for
    0000000001 changed: Map(foo ->
    ListBuffer((0000000000,{"id":"foo","container":null,"address":"tcp://localhost:62092","position":-1,"weight":1,"elected":"0000000000"}),
    (0000000001,{"id":"foo","container":null,"address":null,"position":-1,"weight":1,"elected":null}),
    (0000000002,{"id":"foo","container":null,"address":null,"position":-1,"weight":1,"elected":null})))

    after the timeout:

    2015-03-10 12:10:53,490 | DEBUG | ZooKeeper state change dispatcher thread |
    ZooKeeper group for 0000000001 changed: Map()


    On Sat, Mar 7, 2015 at 8:52 PM, James A. Robinson wrote:

    I think I've got the use a case represented for

    https://issues.apache.org/jira/browse/AMQ-5082

    I could use some advice from others to confirm whether
    or not the underlying assumptions behind my test are valid.

    My assumption is that if 3 ElectingLevelDBStore are running,
    and they lose quorum due to a zk timeout, that once the
    timeout issue is resolved that the pool ought to try and
    re-establish a quorum. Is that a fair assumption?


    https://github.com/jimrobinson/activemq/commit/58b7198880f5296af6b2e4e9efbbdfdb51220411
  • James A. Robinson at Mar 11, 2015 at 10:56 pm

    On Wed, Mar 11, 2015 at 3:28 PM, Gary Tully wrote:
    I think you are correct here. The rebuild should work so long as the
    session has not expired.
    The nodes in the zookeeper group tree are ephemeral, so they
    disappear once the session is lost. I think the underlying client
    manages to re-establish a session, but by that point the nodes
    are lost and have to be re-created (meaning the sequence numbers
    are incremented, so the eid is effectively renamed).

    What I was seeing was the tree was still populated with the old
    nodes. Forcing the rebuild appears to let the rest of the code get
    everything back into the right state, and master elections can
    resume.

    I was hoping the original author might chime in on the problem,
    but unfortunately I didn't get any response when I pinged him
    in email. I haven't done a bunch of programming against
    zookeeper and have done even less in Scala, so I'm not sure
    whether or not there are cleaner, more correct, approaches to
    fixing the problem.

    In any case, I've updated AMQ-5082 with my comments and
    with the pointers to the proposed code.

    Jim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupusers @
categoriesactivemq
postedMar 3, '15 at 3:23p
activeMar 11, '15 at 10:56p
posts8
users4
websiteactivemq.apache.org

People

Translate

site design / logo © 2022 Grokbase