FAQ
Hi, we've setup HDFS HA. We have two namenodes.
One of these nodes is is "bad" state. Unfotunately I have no idea why.
Please help me to understand the root cause of failure.
Here is the state of roles inside

<https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

Here is stderr.log:

+ '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
+ exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
  at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
  at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
  at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
  at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
  at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
  at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
  at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
  at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)


Here is sysout.log with error:

2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


What Do we do wrong?

Search Discussions

  • bc Wong at Apr 24, 2013 at 6:37 pm
    How many ZK servers do you have? (Please make sure that you have 3, or 5.)

    This is a connection problem between the failover controller and ZK. You
    may find more info from the ZK log. I'd suggest that you restart the bad
    failover controller and see if that fixes it.

    Cheers,
    bc

    On Wed, Apr 24, 2013 at 2:25 AM, Serega Sheypak wrote:

    Hi, we've setup HDFS HA. We have two namenodes.
    One of these nodes is is "bad" state. Unfotunately I have no idea why.
    Please help me to understand the root cause of failure.
    Here is the state of roles inside


    <https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

    Here is stderr.log:

    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)



    Here is sysout.log with error:

    2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
    2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


    What Do we do wrong?


  • Serega Sheypak at Apr 24, 2013 at 7:48 pm
    There are 19 zookeepers.
    I've restarted it yesterday. Two hard-resets for the past two days.


    2013/4/24 bc Wong <bcwalrus@cloudera.com>
    How many ZK servers do you have? (Please make sure that you have 3, or 5.)

    This is a connection problem between the failover controller and ZK. You
    may find more info from the ZK log. I'd suggest that you restart the bad
    failover controller and see if that fixes it.

    Cheers,
    bc

    On Wed, Apr 24, 2013 at 2:25 AM, Serega Sheypak wrote:

    Hi, we've setup HDFS HA. We have two namenodes.
    One of these nodes is is "bad" state. Unfotunately I have no idea why.
    Please help me to understand the root cause of failure.
    Here is the state of roles inside


    <https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

    Here is stderr.log:

    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)



    Here is sysout.log with error:

    2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
    2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


    What Do we do wrong?


  • bc Wong at Apr 24, 2013 at 8:24 pm
    That's your problem. 19 is way way way too many, and you're getting hit by
    intra-ZK sync. For a normal cluster, use 3. Or 5 if you really need the
    redundancy. The largest ZK quorum that I've heard of is 9, supporting
    thousands of client machines.


    On Wed, Apr 24, 2013 at 12:48 PM, Serega Sheypak
    wrote:
    There are 19 zookeepers.
    I've restarted it yesterday. Two hard-resets for the past two days.


    2013/4/24 bc Wong <bcwalrus@cloudera.com>
    How many ZK servers do you have? (Please make sure that you have 3, or 5.)

    This is a connection problem between the failover controller and ZK. You
    may find more info from the ZK log. I'd suggest that you restart the bad
    failover controller and see if that fixes it.

    Cheers,
    bc


    On Wed, Apr 24, 2013 at 2:25 AM, Serega Sheypak <serega.sheypak@gmail.com
    wrote:
    Hi, we've setup HDFS HA. We have two namenodes.
    One of these nodes is is "bad" state. Unfotunately I have no idea why.
    Please help me to understand the root cause of failure.
    Here is the state of roles inside


    <https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

    Here is stderr.log:

    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)




    Here is sysout.log with error:

    2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
    2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


    What Do we do wrong?


  • Serega Sheypak at Apr 24, 2013 at 8:41 pm
    We also run Storm on the same machines (~20 storm nodes). It also needs
    zookeeper.
    We had a peception that there is not enough nodes for storm that as why
    we've added more.
    Ok, so we'll try to dig deeper and find what's wrong with storm units of
    parallelism.

    Thank you.


    2013/4/25 bc Wong <bcwalrus@cloudera.com>
    That's your problem. 19 is way way way too many, and you're getting hit by
    intra-ZK sync. For a normal cluster, use 3. Or 5 if you really need the
    redundancy. The largest ZK quorum that I've heard of is 9, supporting
    thousands of client machines.


    On Wed, Apr 24, 2013 at 12:48 PM, Serega Sheypak <serega.sheypak@gmail.com
    wrote:
    There are 19 zookeepers.
    I've restarted it yesterday. Two hard-resets for the past two days.


    2013/4/24 bc Wong <bcwalrus@cloudera.com>
    How many ZK servers do you have? (Please make sure that you have 3, or
    5.)

    This is a connection problem between the failover controller and ZK. You
    may find more info from the ZK log. I'd suggest that you restart the bad
    failover controller and see if that fixes it.

    Cheers,
    bc


    On Wed, Apr 24, 2013 at 2:25 AM, Serega Sheypak <
    serega.sheypak@gmail.com> wrote:
    Hi, we've setup HDFS HA. We have two namenodes.
    One of these nodes is is "bad" state. Unfotunately I have no idea why.
    Please help me to understand the root cause of failure.
    Here is the state of roles inside


    <https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

    Here is stderr.log:

    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)




    Here is sysout.log with error:

    2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
    2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


    What Do we do wrong?


  • bc Wong at Apr 24, 2013 at 8:45 pm
    It's fine for Storm to use 20 nodes. But ZK doesn't. Having 19 ZK nodes
    actually reduces your ZK write performance. See the ZK paper:
    http://static.usenix.org/events/usenix10/tech/full_papers/Hunt.pdf (table
    1).

    On Wed, Apr 24, 2013 at 1:41 PM, Serega Sheypak wrote:

    We also run Storm on the same machines (~20 storm nodes). It also needs
    zookeeper.
    We had a peception that there is not enough nodes for storm that as why
    we've added more.
    Ok, so we'll try to dig deeper and find what's wrong with storm units of
    parallelism.

    Thank you.


    2013/4/25 bc Wong <bcwalrus@cloudera.com>
    That's your problem. 19 is way way way too many, and you're getting hit
    by intra-ZK sync. For a normal cluster, use 3. Or 5 if you really need the
    redundancy. The largest ZK quorum that I've heard of is 9, supporting
    thousands of client machines.


    On Wed, Apr 24, 2013 at 12:48 PM, Serega Sheypak <
    serega.sheypak@gmail.com> wrote:
    There are 19 zookeepers.
    I've restarted it yesterday. Two hard-resets for the past two days.


    2013/4/24 bc Wong <bcwalrus@cloudera.com>
    How many ZK servers do you have? (Please make sure that you have 3, or
    5.)

    This is a connection problem between the failover controller and ZK.
    You may find more info from the ZK log. I'd suggest that you restart the
    bad failover controller and see if that fixes it.

    Cheers,
    bc


    On Wed, Apr 24, 2013 at 2:25 AM, Serega Sheypak <
    serega.sheypak@gmail.com> wrote:
    Hi, we've setup HDFS HA. We have two namenodes.
    One of these nodes is is "bad" state. Unfotunately I have no idea why.
    Please help me to understand the root cause of failure.
    Here is the state of roles inside


    <https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

    Here is stderr.log:

    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)





    Here is sysout.log with error:

    2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
    2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


    What Do we do wrong?


  • Serega Sheypak at May 6, 2013 at 7:28 am
    I've reduced qtty of Zookeepers to 5 instances.
    One of my NameNodes is in failed state again.
    Here is a part of stdErr log:

    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
      at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
      at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
      at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
      at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
      at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
      at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
      at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
      at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)


    Here is a part fo role log details:


    03:14:07.969 INFO org.apache.hadoop.ha.ZKFailoverController
    ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
    03:14:07.974 INFO org.apache.hadoop.ha.ZKFailoverController
    Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
    03:15:10.000 INFO org.apache.zookeeper.ClientCnxn
    Unable to read additional data from server sessionid 0x63e41ac865a1d4d, likely server has closed socket, closing socket connection and attempting reconnect
    03:15:10.111 FATAL org.apache.hadoop.ha.ActiveStandbyElector
    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.888 INFO org.apache.zookeeper.ZooKeeper
    Session: 0x63e41ac865a1d4d closed
    03:15:10.889 FATAL org.apache.hadoop.ha.ZKFailoverController
    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.889 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.889 INFO org.apache.hadoop.ipc.Server
    Stopping server on 8019
    03:15:10.889 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 INFO org.apache.zookeeper.ClientCnxn
    EventThread shut down
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server Responder
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server listener on 8019
    03:15:10.890 INFO org.apache.hadoop.ha.ActiveStandbyElector
    Yielding from election
    03:15:10.891 INFO org.apache.hadoop.ha.HealthMonitor
    Stopping HealthMonitor thread


    What do I do wrong?

    четверг, 25 апреля 2013 г., 0:45:06 UTC+4 пользователь bc Wong написал:
    It's fine for Storm to use 20 nodes. But ZK doesn't. Having 19 ZK nodes
    actually reduces your ZK write performance. See the ZK paper:
    http://static.usenix.org/events/usenix10/tech/full_papers/Hunt.pdf (table
    1).


    On Wed, Apr 24, 2013 at 1:41 PM, Serega Sheypak <serega....@gmail.com<javascript:>
    wrote:
    We also run Storm on the same machines (~20 storm nodes). It also needs
    zookeeper.
    We had a peception that there is not enough nodes for storm that as why
    we've added more.
    Ok, so we'll try to dig deeper and find what's wrong with storm units of
    parallelism.

    Thank you.


    2013/4/25 bc Wong <bcwa...@cloudera.com <javascript:>>
    That's your problem. 19 is way way way too many, and you're getting hit
    by intra-ZK sync. For a normal cluster, use 3. Or 5 if you really need the
    redundancy. The largest ZK quorum that I've heard of is 9, supporting
    thousands of client machines.


    On Wed, Apr 24, 2013 at 12:48 PM, Serega Sheypak <serega....@gmail.com<javascript:>
    wrote:
    There are 19 zookeepers.
    I've restarted it yesterday. Two hard-resets for the past two days.


    2013/4/24 bc Wong <bcwa...@cloudera.com <javascript:>>
    How many ZK servers do you have? (Please make sure that you have 3, or
    5.)

    This is a connection problem between the failover controller and ZK.
    You may find more info from the ZK log. I'd suggest that you restart the
    bad failover controller and see if that fixes it.

    Cheers,
    bc


    On Wed, Apr 24, 2013 at 2:25 AM, Serega Sheypak <serega....@gmail.com<javascript:>
    wrote:
    Hi, we've setup HDFS HA. We have two namenodes.
    One of these nodes is is "bad" state. Unfotunately I have no idea
    why. Please help me to understand the root cause of failure.
    Here is the state of roles inside


    <https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

    Here is stderr.log:

    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)



    Here is sysout.log with error:

    2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
    2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


    What Do we do wrong?


  • Serega Sheypak at May 14, 2013 at 9:53 am
    It happened again, one of the failover controlles is in *bad* state
    how can I get the root cause of failure?
    3:39:08.233INFOorg.apache.hadoop.ha.ZKFailoverController

    Successfully transitioned NameNode at prod-node033.lol.ru/10.66.49.193:8020 to active state

    03:43:17.993INFOorg.apache.zookeeper.ClientCnxn

    Unable to read additional data from server sessionid 0x83e9de45d190000, likely server has closed socket, closing socket connection and attempting reconnect

    03:43:18.103FATALorg.apache.hadoop.ha.ActiveStandbyElector

    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:43:19.193INFOorg.apache.zookeeper.ZooKeeper

    Session: 0x83e9de45d190000 closed

    03:43:19.194FATALorg.apache.hadoop.ha.ZKFailoverController

    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:43:19.194INFOorg.apache.hadoop.ipc.Server

    Stopping server on 8019

    03:43:19.194WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.194INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server listener on 8019

    03:43:19.194INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server Responder

    03:43:19.194WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.194INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Yielding from election

    03:43:19.195INFOorg.apache.hadoop.ha.HealthMonitor

    Stopping HealthMonitor thread

    03:43:19.195WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.196WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.196INFOorg.apache.zookeeper.ClientCnxn

    EventThread shut down


    понедельник, 6 мая 2013 г., 20:51:38 UTC+4 пользователь bc Wong написал:
    Moving to cdh-user.

    The ZK CONNECTIONLOSS is a sign of connectivity problem between the ZK
    server and the client (Failover Controller). That unfortunately could be a
    lot of different reasons. What do the ZK logs say? Have you experienced
    flaky networking before? What's the load on your ZK servers?

    Cheers,
    bc

    On Mon, May 6, 2013 at 12:28 AM, Serega Sheypak <serega....@gmail.com<javascript:>
    wrote:
    I've reduced qtty of Zookeepers to 5 instances.
    One of my NameNodes is in failed state again.
    Here is a part of stdErr log:

    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)


    Here is a part fo role log details:


    03:14:07.969 INFO org.apache.hadoop.ha.ZKFailoverController
    ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.155:8020 should become standby
    03:14:07.974 INFO org.apache.hadoop.ha.ZKFailoverController
    Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.155:8020 to standby state
    03:15:10.000 INFO org.apache.zookeeper.ClientCnxn
    Unable to read additional data from server sessionid 0x63e41ac865a1d4d, likely server has closed socket, closing socket connection and attempting reconnect
    03:15:10.111 FATAL org.apache.hadoop.ha.ActiveStandbyElector
    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.888 INFO org.apache.zookeeper.ZooKeeper
    Session: 0x63e41ac865a1d4d closed
    03:15:10.889 FATAL org.apache.hadoop.ha.ZKFailoverController
    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.889 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.889 INFO org.apache.hadoop.ipc.Server
    Stopping server on 8019
    03:15:10.889 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 INFO org.apache.zookeeper.ClientCnxn
    EventThread shut down
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server Responder
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server listener on 8019
    03:15:10.890 INFO org.apache.hadoop.ha.ActiveStandbyElector
    Yielding from election
    03:15:10.891 INFO org.apache.hadoop.ha.HealthMonitor
    Stopping HealthMonitor thread


    What do I do wrong?
  • Vinithra Varadharajan at May 15, 2013 at 7:33 pm
    Serega,
    It seems like there's a problem with the interaction between the
    FailoverController and ZK. Can you attach your ZK server logs?

    -Vinithra

    On Tue, May 14, 2013 at 2:53 AM, Serega Sheypak wrote:

    It happened again, one of the failover controlles is in *bad* state
    how can I get the root cause of failure?
    3:39:08.233 INFO org.apache.hadoop.ha.ZKFailoverController


    Successfully transitioned NameNode at prod-node033.lol.ru/10.66.49.193:8020 to active state

    03:43:17.993 INFOorg.apache.zookeeper.ClientCnxn


    Unable to read additional data from server sessionid 0x83e9de45d190000, likely server has closed socket, closing socket connection and attempting reconnect

    03:43:18.103 FATALorg.apache.hadoop.ha.ActiveStandbyElector


    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:43:19.193 INFOorg.apache.zookeeper.ZooKeeper


    Session: 0x83e9de45d190000 closed

    03:43:19.194FATAL org.apache.hadoop.ha.ZKFailoverController


    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:43:19.194 INFOorg.apache.hadoop.ipc.Server


    Stopping server on 8019

    03:43:19.194WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.194 INFOorg.apache.hadoop.ipc.Server


    Stopping IPC Server listener on 8019

    03:43:19.194INFO org.apache.hadoop.ipc.Server


    Stopping IPC Server Responder

    03:43:19.194WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.194INFO org.apache.hadoop.ha.ActiveStandbyElector


    Yielding from election

    03:43:19.195INFO org.apache.hadoop.ha.HealthMonitor


    Stopping HealthMonitor thread

    03:43:19.195WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.196WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.196INFO org.apache.zookeeper.ClientCnxn


    EventThread shut down


    понедельник, 6 мая 2013 г., 20:51:38 UTC+4 пользователь bc Wong написал:
    Moving to cdh-user.

    The ZK CONNECTIONLOSS is a sign of connectivity problem between the ZK
    server and the client (Failover Controller). That unfortunately could be a
    lot of different reasons. What do the ZK logs say? Have you experienced
    flaky networking before? What's the load on your ZK servers?

    Cheers,
    bc
    On Mon, May 6, 2013 at 12:28 AM, Serega Sheypak wrote:

    I've reduced qtty of Zookeepers to 5 instances.
    One of my NameNodes is in failed state again.
    Here is a part of stdErr log:

    + exec /opt/cloudera/parcels/CDH-4.2.**0-1.cdh4.2.0.p0.10/lib/hadoop-**hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/**process/1397-hdfs-**FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/**ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.**ZKFailoverController.mainLoop(**ZKFailoverController.java:359)
    at org.apache.hadoop.ha.**ZKFailoverController.doRun(**ZKFailoverController.java:231)
    at org.apache.hadoop.ha.**ZKFailoverController.access$**000(ZKFailoverController.java:**58)
    at org.apache.hadoop.ha.**ZKFailoverController$1.run(**ZKFailoverController.java:165)
    at org.apache.hadoop.ha.**ZKFailoverController$1.run(**ZKFailoverController.java:161)
    at org.apache.hadoop.security.**SecurityUtil.**doAsLoginUserOrFatal(**SecurityUtil.java:452)
    at org.apache.hadoop.ha.**ZKFailoverController.run(**ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.**DFSZKFailoverController.main(**DFSZKFailoverController.java:**175)



    Here is a part fo role log details:


    03:14:07.969 INFO org.apache.hadoop.ha.**ZKFailoverController
    ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.**155:8020 <http://prod-node015.lol.ru/10.66.49.155:8020> should become standby
    03:14:07.974 INFO org.apache.hadoop.ha.**ZKFailoverController
    Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.**155:8020 <http://prod-node015.lol.ru/10.66.49.155:8020> to standby state
    03:15:10.000 INFO org.apache.zookeeper.**ClientCnxn
    Unable to read additional data from server sessionid 0x63e41ac865a1d4d, likely server has closed socket, closing socket connection and attempting reconnect
    03:15:10.111 FATAL org.apache.hadoop.ha.**ActiveStandbyElector
    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/**ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.888 INFO org.apache.zookeeper.ZooKeeper
    Session: 0x63e41ac865a1d4d closed
    03:15:10.889 FATAL org.apache.hadoop.ha.**ZKFailoverController
    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/**ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.889 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.889 INFO org.apache.hadoop.ipc.Server
    Stopping server on 8019
    03:15:10.889 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 INFO org.apache.zookeeper.**ClientCnxn
    EventThread shut down
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server Responder
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server listener on 8019
    03:15:10.890 INFO org.apache.hadoop.ha.**ActiveStandbyElector
    Yielding from election
    03:15:10.891 INFO org.apache.hadoop.ha.**HealthMonitor
    Stopping HealthMonitor thread


    What do I do wrong?
  • Serega Sheypak at May 17, 2013 at 7:17 am
    It happened again.
    Here is related problem:
    https://groups.google.com/a/cloudera.org/forum/?fromgroups#!topic/cdh-user/s7i5nj9vJBk
    *Here is log from failover controller prod-node033*
    03:19:08.241INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    03:19:08.241INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    03:19:08.241INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    03:19:08.242INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Checking for any old active which needs to be fenced...

    03:19:08.243INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Old node exists:
    0a0c6e616d657365727669636531120b6e616d656e6f64653137311a1b70726f642d6e6f64653033332e6b79632e6d656761666f6e2e727520d43e28d33e

    03:19:08.243INFOorg.apache.hadoop.ha.ActiveStandbyElector

    But old node has our own data, so don't need to fence it.

    03:19:08.243INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Writing znode /hadoop-ha/nameservice1/ActiveBreadCrumb to indicate
    that the local node is the most recent active...

    03:19:08.246INFOorg.apache.hadoop.ha.ZKFailoverController

    Trying to make NameNode at prod-node033.ru/10.66.49.193:8020 active...

    03:19:08.252INFOorg.apache.hadoop.ha.ZKFailoverController

    Successfully transitioned NameNode at
    prod-node033.ru/10.66.49.193:8020 to active state

    03:27:09.993INFOorg.apache.zookeeper.ClientCnxn

    Unable to read additional data from server sessionid
    0x133ea8b05d740000, likely server has closed socket, closing socket
    connection and attempting reconnect

    03:27:10.095FATALorg.apache.hadoop.ha.ActiveStandbyElector

    Received create error from Zookeeper. code:CONNECTIONLOSS for path
    /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further
    znode create connection errors.

    03:27:10.511INFOorg.apache.zookeeper.ZooKeeper

    Session: 0x133ea8b05d740000 closed

    03:27:10.511FATALorg.apache.hadoop.ha.ZKFailoverController

    Fatal error occurred:Received create error from Zookeeper.
    code:CONNECTIONLOSS for path
    /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further
    znode create connection errors.

    03:27:10.513INFOorg.apache.hadoop.ipc.Server

    Stopping server on 8019

    03:27:10.513WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.513WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.513WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.513INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server listener on 8019

    03:27:10.513INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server Responder

    03:27:10.513WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Yielding from election

    03:27:10.514INFOorg.apache.hadoop.ha.HealthMonitor

    Stopping HealthMonitor thread

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x133ea8b05d740000




    Here is log from failover controller prod-015:
    16:47:07.223INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    16:47:07.223INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    16:47:07.224INFOorg.apache.hadoop.ha.ZKFailoverController

    ZK Election indicated that NameNode at
    prod-node015.ru/10.66.49.155:8020 should become standby

    16:47:07.229INFOorg.apache.hadoop.ha.ZKFailoverController

    Successfully transitioned NameNode at
    prod-node015.ru/10.66.49.155:8020 to standby state

    03:27:07.109INFOorg.apache.zookeeper.ClientCnxn

    Unable to read additional data from server sessionid
    0x83ea8b063e20001, likely server has closed socket, closing socket
    connection and attempting reconnect

    03:27:07.210FATALorg.apache.hadoop.ha.ActiveStandbyElector

    Received create error from Zookeeper. code:CONNECTIONLOSS for path
    /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further
    znode create connection errors.

    03:27:08.147INFOorg.apache.zookeeper.ZooKeeper

    Session: 0x83ea8b063e20001 closed

    03:27:08.148FATALorg.apache.hadoop.ha.ZKFailoverController

    Fatal error occurred:Received create error from Zookeeper.
    code:CONNECTIONLOSS for path
    /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further
    znode create connection errors.

    03:27:08.149INFOorg.apache.hadoop.ipc.Server

    Stopping server on 8019

    03:27:08.149WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.149WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.149WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.149INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server Responder

    03:27:08.149INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Yielding from election

    03:27:08.150INFOorg.apache.hadoop.ha.HealthMonitor

    Stopping HealthMonitor thread

    03:27:08.149INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server listener on 8019

    03:27:08.150WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151INFOorg.apache.zookeeper.ClientCnxn

    EventThread shut down


    Here are log from one of the Zookeeper server. We have 5 servers in this
    service. All logs are the same:
    11:09:34.322WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:34.446WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:34.475WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:34.594WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:34.738WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:34.834WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:35.075WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:35.297WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:35.466WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60

    11:09:35.499WARNorg.apache.zookeeper.server.NIOServerCnxnFactory

    Too many connections from /10.66.49.134 - max is 60


    10.66.49.134 is a server where hiveserver2 is located. I have one more
    discussion where the problem is that hiveserver2 causes DoS on
    Zookeeper service.


    11:11:34.967INFOorg.apache.zookeeper.ClientCnxn

    Opening socket connection to server prod-node031.ru/10.66.49.189:2181.
    Will not attempt to authenticate using SASL (Unable to locate a login
    configuration)

    11:11:34.968INFOorg.apache.zookeeper.ClientCnxn

    Socket connection established to prod-node031.ru/10.66.49.189:2181,
    initiating session

    11:11:34.968WARNorg.apache.zookeeper.ClientCnxn

    Session 0x0 for server prod-node031.ru/10.66.49.189:2181, unexpected
    error, closing socket connection and attempting reconnect
    java.io.IOException: Connection reset by peer
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:166)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)

    11:11:34.974INFOorg.apache.zookeeper.ClientCnxn

    Opening socket connection to server prod-node040.ru/10.66.49.207:2181.
    Will not attempt to authenticate using SASL (Unable to locate a login
    configuration)

    11:11:34.975INFOorg.apache.zookeeper.ClientCnxn

    Socket connection established to prod-node040.ru/10.66.49.207:2181,
    initiating session

    11:11:34.975WARNorg.apache.zookeeper.ClientCnxn

    Session 0x0 for server prod-node040.ru/10.66.49.207:2181, unexpected
    error, closing socket connection and attempting reconnect
    java.io.IOException: Connection reset by peer
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:166)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)

    11:11:35.096INFOorg.apache.zookeeper.ClientCnxn

    Opening socket connection to server prod-node031.ru/10.66.49.189:2181.
    Will not attempt to authenticate using SASL (Unable to locate a login
    configuration)

    11:11:35.096INFOorg.apache.zookeeper.ClientCnxn

    Socket connection established to prod-node031.ru/10.66.49.189:2181,
    initiating session

    11:11:35.097WARNorg.apache.zookeeper.ClientCnxn

    Session 0x0 for server prod-node031.ru/10.66.49.189:2181, unexpected
    error, closing socket connection and attempting reconnect
    java.io.IOException: Connection reset by peer
      at sun.nio.ch.FileDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
      at sun.nio.ch.IOUtil.read(IOUtil.java:166)
      at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
      at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)

    11:11:35.152
    2013/5/16 Serega Sheypak <serega.sheypak@gmail.com>
    I'll do it when it happens next time. Waiting for it..


    2013/5/15 Vinithra Varadharajan <vinithra@cloudera.com>
    Serega,
    It seems like there's a problem with the interaction between the
    FailoverController and ZK. Can you attach your ZK server logs?

    -Vinithra


    On Tue, May 14, 2013 at 2:53 AM, Serega Sheypak <serega.sheypak@gmail.com
    wrote:
    It happened again, one of the failover controlles is in *bad* state
    how can I get the root cause of failure?
    3:39:08.233 INFO org.apache.hadoop.ha.ZKFailoverController

    Successfully transitioned NameNode at prod-node033.lol.ru/10.66.49.193:8020 to active state

    03:43:17.993 INFOorg.apache.zookeeper.ClientCnxn

    Unable to read additional data from server sessionid 0x83e9de45d190000, likely server has closed socket, closing socket connection and attempting reconnect

    03:43:18.103 FATALorg.apache.hadoop.ha.ActiveStandbyElector

    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:43:19.193 INFOorg.apache.zookeeper.ZooKeeper

    Session: 0x83e9de45d190000 closed

    03:43:19.194FATAL org.apache.hadoop.ha.ZKFailoverController

    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:43:19.194 INFOorg.apache.hadoop.ipc.Server

    Stopping server on 8019

    03:43:19.194WARN org.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.194 INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server listener on 8019

    03:43:19.194INFO org.apache.hadoop.ipc.Server

    Stopping IPC Server Responder

    03:43:19.194WARN org.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.194INFO org.apache.hadoop.ha.ActiveStandbyElector

    Yielding from election

    03:43:19.195INFO org.apache.hadoop.ha.HealthMonitor

    Stopping HealthMonitor thread

    03:43:19.195WARN org.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195 WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARN org.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195 WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195WARN org.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.195 WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.196WARN org.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x83e9de45d190000

    03:43:19.196INFO org.apache.zookeeper.ClientCnxn

    EventThread shut down


    понедельник, 6 мая 2013 г., 20:51:38 UTC+4 пользователь bc Wong написал:
    Moving to cdh-user.

    The ZK CONNECTIONLOSS is a sign of connectivity problem between the ZK
    server and the client (Failover Controller). That unfortunately could be a
    lot of different reasons. What do the ZK logs say? Have you experienced
    flaky networking before? What's the load on your ZK servers?

    Cheers,
    bc
    On Mon, May 6, 2013 at 12:28 AM, Serega Sheypak wrote:

    I've reduced qtty of Zookeepers to 5 instances.
    One of my NameNodes is in failed state again.
    Here is a part of stdErr log:

    + exec /opt/cloudera/parcels/CDH-4.2.**0-1.cdh4.2.0.p0.10/lib/hadoop-**hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/**process/1397-hdfs-**FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/**ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.**ZKFailoverController.mainLoop(**ZKFailoverController.java:359)
    at org.apache.hadoop.ha.**ZKFailoverController.doRun(**ZKFailoverController.java:231)
    at org.apache.hadoop.ha.**ZKFailoverController.access$**000(ZKFailoverController.java:**58)
    at org.apache.hadoop.ha.**ZKFailoverController$1.run(**ZKFailoverController.java:165)
    at org.apache.hadoop.ha.**ZKFailoverController$1.run(**ZKFailoverController.java:161)
    at org.apache.hadoop.security.**SecurityUtil.**doAsLoginUserOrFatal(**SecurityUtil.java:452)
    at org.apache.hadoop.ha.**ZKFailoverController.run(**ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.**DFSZKFailoverController.main(**DFSZKFailoverController.java:**175)


    Here is a part fo role log details:


    03:14:07.969 INFO org.apache.hadoop.ha.**ZKFailoverController
    ZK Election indicated that NameNode at prod-node015.lol.ru/10.66.49.**155:8020 <http://prod-node015.lol.ru/10.66.49.155:8020> should become standby
    03:14:07.974 INFO org.apache.hadoop.ha.**ZKFailoverController
    Successfully transitioned NameNode at prod-node015.lol.ru/10.66.49.**155:8020 <http://prod-node015.lol.ru/10.66.49.155:8020> to standby state
    03:15:10.000 INFO org.apache.zookeeper.**ClientCnxn
    Unable to read additional data from server sessionid 0x63e41ac865a1d4d, likely server has closed socket, closing socket connection and attempting reconnect
    03:15:10.111 FATAL org.apache.hadoop.ha.**ActiveStandbyElector
    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/**ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.888 INFO org.apache.zookeeper.ZooKeeper
    Session: 0x63e41ac865a1d4d closed
    03:15:10.889 FATAL org.apache.hadoop.ha.**ZKFailoverController
    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/**ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    03:15:10.889 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.889 INFO org.apache.hadoop.ipc.Server
    Stopping server on 8019
    03:15:10.889 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 WARN org.apache.hadoop.ha.**ActiveStandbyElector
    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d
    03:15:10.890 INFO org.apache.zookeeper.**ClientCnxn
    EventThread shut down
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server Responder
    03:15:10.890 INFO org.apache.hadoop.ipc.Server
    Stopping IPC Server listener on 8019
    03:15:10.890 INFO org.apache.hadoop.ha.**ActiveStandbyElector
    Yielding from election
    03:15:10.891 INFO org.apache.hadoop.ha.**HealthMonitor
    Stopping HealthMonitor thread


    What do I do wrong?
  • Vinithra Varadharajan at May 17, 2013 at 8:50 pm
    10.66.49.134 is a server where hiveserver2 is located. I have one more
    discussion where the problem is that hiveserver2 causes DoS on Zookeeper
    service.
    Given that the FC talks to the ZK service regularly, not being able to do
    so if the ZK service is not available will make the FC unhappy. Based on
    the other discussion, looks like you're working toward fixing the HS2-ZK
    situation. Let us know if the FC continues to be unhappy after the ZK
    service is available.


    On Fri, May 17, 2013 at 12:17 AM, Serega Sheypak
    wrote:
    It happened again.
    Here is related problem:
    https://groups.google.com/a/cloudera.org/forum/?fromgroups#!topic/cdh-user/s7i5nj9vJBk
    *Here is log from failover controller prod-node033*
    03:19:08.241INFO org.apache.hadoop.ha.ActiveStandbyElector


    Session connected.

    03:19:08.241INFO org.apache.hadoop.ha.ActiveStandbyElector


    Session connected.

    03:19:08.241INFO org.apache.hadoop.ha.ActiveStandbyElector


    Session connected.

    03:19:08.242INFO org.apache.hadoop.ha.ActiveStandbyElector


    Checking for any old active which needs to be fenced...

    03:19:08.243INFO org.apache.hadoop.ha.ActiveStandbyElector


    Old node exists: 0a0c6e616d657365727669636531120b6e616d656e6f64653137311a1b70726f642d6e6f64653033332e6b79632e6d656761666f6e2e727520d43e28d33e

    03:19:08.243 INFOorg.apache.hadoop.ha.ActiveStandbyElector


    But old node has our own data, so don't need to fence it.

    03:19:08.243INFO org.apache.hadoop.ha.ActiveStandbyElector


    Writing znode /hadoop-ha/nameservice1/ActiveBreadCrumb to indicate that the local node is the most recent active...

    03:19:08.246 INFOorg.apache.hadoop.ha.ZKFailoverController


    Trying to make NameNode at prod-node033.ru/10.66.49.193:8020 active...

    03:19:08.252 INFOorg.apache.hadoop.ha.ZKFailoverController


    Successfully transitioned NameNode at prod-node033.ru/10.66.49.193:8020 to active state

    03:27:09.993 INFOorg.apache.zookeeper.ClientCnxn


    Unable to read additional data from server sessionid 0x133ea8b05d740000, likely server has closed socket, closing socket connection and attempting reconnect

    03:27:10.095 FATALorg.apache.hadoop.ha.ActiveStandbyElector


    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:27:10.511 INFOorg.apache.zookeeper.ZooKeeper


    Session: 0x133ea8b05d740000 closed

    03:27:10.511FATAL org.apache.hadoop.ha.ZKFailoverController


    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:27:10.513 INFOorg.apache.hadoop.ipc.Server


    Stopping server on 8019

    03:27:10.513WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.513WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.513 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.513INFO org.apache.hadoop.ipc.Server


    Stopping IPC Server listener on 8019

    03:27:10.513INFO org.apache.hadoop.ipc.Server


    Stopping IPC Server Responder

    03:27:10.513WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.514 INFOorg.apache.hadoop.ha.ActiveStandbyElector


    Yielding from election

    03:27:10.514INFO org.apache.hadoop.ha.HealthMonitor


    Stopping HealthMonitor thread

    03:27:10.515WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.515 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000

    03:27:10.516WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x133ea8b05d740000




    Here is log from failover controller prod-015:
    16:47:07.223INFO org.apache.hadoop.ha.ActiveStandbyElector


    Session connected.

    16:47:07.223INFO org.apache.hadoop.ha.ActiveStandbyElector


    Session connected.

    16:47:07.224INFO org.apache.hadoop.ha.ZKFailoverController


    ZK Election indicated that NameNode at prod-node015.ru/10.66.49.155:8020 should become standby

    16:47:07.229 INFOorg.apache.hadoop.ha.ZKFailoverController


    Successfully transitioned NameNode at prod-node015.ru/10.66.49.155:8020 to standby state

    03:27:07.109 INFOorg.apache.zookeeper.ClientCnxn


    Unable to read additional data from server sessionid 0x83ea8b063e20001, likely server has closed socket, closing socket connection and attempting reconnect

    03:27:07.210 FATALorg.apache.hadoop.ha.ActiveStandbyElector


    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:27:08.147 INFOorg.apache.zookeeper.ZooKeeper


    Session: 0x83ea8b063e20001 closed

    03:27:08.148FATAL org.apache.hadoop.ha.ZKFailoverController


    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:27:08.149 INFOorg.apache.hadoop.ipc.Server


    Stopping server on 8019

    03:27:08.149WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.149 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.149WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.149 INFOorg.apache.hadoop.ipc.Server


    Stopping IPC Server Responder

    03:27:08.149INFO org.apache.hadoop.ha.ActiveStandbyElector


    Yielding from election

    03:27:08.150INFO org.apache.hadoop.ha.HealthMonitor


    Stopping HealthMonitor thread

    03:27:08.149INFO org.apache.hadoop.ipc.Server


    Stopping IPC Server listener on 8019

    03:27:08.150WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.150WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151 WARNorg.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151WARN org.apache.hadoop.ha.ActiveStandbyElector


    Ignoring stale result from old client with sessionId 0x83ea8b063e20001

    03:27:08.151INFO org.apache.zookeeper.ClientCnxn


    EventThread shut down


    Here are log from one of the Zookeeper server. We have 5 servers in this
    service. All logs are the same:
    11:09:34.322WARN org.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:34.446WARN org.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:34.475 WARNorg.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:34.594WARN org.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:34.738 WARNorg.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:34.834WARN org.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:35.075 WARNorg.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:35.297WARN org.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:35.466 WARNorg.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60

    11:09:35.499WARN org.apache.zookeeper.server.NIOServerCnxnFactory


    Too many connections from /10.66.49.134 - max is 60



    10.66.49.134 is a server where hiveserver2 is located. I have one more discussion where the problem is that hiveserver2 causes DoS on Zookeeper service.


    11:11:34.967INFO org.apache.zookeeper.ClientCnxn


    Opening socket connection to server prod-node031.ru/10.66.49.189:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)

    11:11:34.968 INFO org.apache.zookeeper.ClientCnxn


    Socket connection established to prod-node031.ru/10.66.49.189:2181, initiating session

    11:11:34.968 WARNorg.apache.zookeeper.ClientCnxn


    Session 0x0 for server prod-node031.ru/10.66.49.189:2181, unexpected error, closing socket connection and attempting reconnect
    java.io.IOException: Connection reset by peer


    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
    at sun.nio.ch.IOUtil.read(IOUtil.java:166)


    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:355)


    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)

    11:11:34.974INFO org.apache.zookeeper.ClientCnxn


    Opening socket connection to server prod-node040.ru/10.66.49.207:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)

    11:11:34.975INFO org.apache.zookeeper.ClientCnxn


    Socket connection established to prod-node040.ru/10.66.49.207:2181, initiating session

    11:11:34.975 WARNorg.apache.zookeeper.ClientCnxn


    Session 0x0 for server prod-node040.ru/10.66.49.207:2181, unexpected error, closing socket connection and attempting reconnect
    java.io.IOException: Connection reset by peer


    at sun.nio.ch.FileDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
    at sun.nio.ch.IOUtil.read(IOUtil.java:166)


    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
    at org.a

    ...

    [Message clipped]
  • Serega Sheypak at May 6, 2013 at 7:25 am
    Hi, I've reduced the number of Zookeepers to 5 nodes.
    The problem has returned. One of my NameNodes from HA nameNode server is in
    failed state.
    Here is stdErr of failed Node from ClouderaManager:

    Чтв Апр 25 18:53:44 MSK 2013
    + locate_java_home
    + '[' -z '' ']'
    + for candidate in /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*' '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/' /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre '/usr/java/jdk1.6*' '/usr/java/jre1.6*' /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk '/usr/lib/jvm/java-1.6.0-openjdk-1.6.*' '/usr/lib/jvm/jre-1.6.0-openjdk*'
    + '[' -e /usr/lib/j2sdk1.6-sun/bin/java ']'
    + for candidate in /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*' '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/' /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre '/usr/java/jdk1.6*' '/usr/java/jre1.6*' /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk '/usr/lib/jvm/java-1.6.0-openjdk-1.6.*' '/usr/lib/jvm/jre-1.6.0-openjdk*'
    + '[' -e /usr/lib/jvm/java-6-sun/bin/java ']'
    + for candidate in /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*' '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/' /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre '/usr/java/jdk1.6*' '/usr/java/jre1.6*' /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk '/usr/lib/jvm/java-1.6.0-openjdk-1.6.*' '/usr/lib/jvm/jre-1.6.0-openjdk*'
    + '[' -e '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/bin/java' ']'
    + for candidate in /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*' '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/' /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre '/usr/java/jdk1.6*' '/usr/java/jre1.6*' /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk '/usr/lib/jvm/java-1.6.0-openjdk-1.6.*' '/usr/lib/jvm/jre-1.6.0-openjdk*'
    + '[' -e '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre//bin/java' ']'
    + for candidate in /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*' '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/' /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre '/usr/java/jdk1.6*' '/usr/java/jre1.6*' /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk '/usr/lib/jvm/java-1.6.0-openjdk-1.6.*' '/usr/lib/jvm/jre-1.6.0-openjdk*'
    + '[' -e /usr/lib/jvm/j2sdk1.6-oracle/bin/java ']'
    + for candidate in /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*' '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/' /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre '/usr/java/jdk1.6*' '/usr/java/jre1.6*' /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk '/usr/lib/jvm/java-1.6.0-openjdk-1.6.*' '/usr/lib/jvm/jre-1.6.0-openjdk*'
    + '[' -e /usr/lib/jvm/j2sdk1.6-oracle/jre/bin/java ']'
    + for candidate in /usr/lib/j2sdk1.6-sun /usr/lib/jvm/java-6-sun '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*' '/usr/lib/jvm/java-1.6.0-sun-1.6.0.*/jre/' /usr/lib/jvm/j2sdk1.6-oracle /usr/lib/jvm/j2sdk1.6-oracle/jre '/usr/java/jdk1.6*' '/usr/java/jre1.6*' /Library/Java/Home /usr/java/default /usr/lib/jvm/default-java /usr/lib/jvm/java-openjdk /usr/lib/jvm/jre-openjdk '/usr/lib/jvm/java-1.6.0-openjdk-1.6.*' '/usr/lib/jvm/jre-1.6.0-openjdk*'
    + '[' -e /usr/java/jdk1.6.0_37/bin/java ']'
    + export JAVA_HOME=/usr/java/jdk1.6.0_37
    + JAVA_HOME=/usr/java/jdk1.6.0_37
    + break
    + '[' -z /usr/java/jdk1.6.0_37 ']'
    + source_parcel_environment
    + '[' '!' -z /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/meta/cdh_env.sh ']'
    + OLD_IFS='
    '
    + IFS=:
    + SCRIPT_ARRAY=($SCM_DEFINES_SCRIPTS)
    + IFS='
    '
    + for SCRIPT in '${SCRIPT_ARRAY[@]}'
    + . /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/meta/cdh_env.sh
    ++ export CDH_HADOOP_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ CDH_HADOOP_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ export CDH_MR1_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-0.20-mapreduce
    ++ CDH_MR1_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-0.20-mapreduce
    ++ export CDH_HDFS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs
    ++ CDH_HDFS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs
    ++ export CDH_HTTPFS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-httpfs
    ++ CDH_HTTPFS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-httpfs
    ++ export CDH_MR2_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-mapreduce
    ++ CDH_MR2_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-mapreduce
    ++ export CDH_YARN_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-yarn
    ++ CDH_YARN_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-yarn
    ++ export CDH_HBASE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hbase
    ++ CDH_HBASE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hbase
    ++ export CDH_ZOOKEEPER_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/zookeeper
    ++ CDH_ZOOKEEPER_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/zookeeper
    ++ export CDH_HIVE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive
    ++ CDH_HIVE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive
    ++ export CDH_HUE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/share/hue
    ++ CDH_HUE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/share/hue
    ++ export CDH_OOZIE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/oozie
    ++ CDH_OOZIE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/oozie
    ++ export CDH_HUE_PLUGINS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ CDH_HUE_PLUGINS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ export CDH_FLUME_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/flume-ng
    ++ CDH_FLUME_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/flume-ng
    ++ export CDH_PIG_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/pig
    ++ CDH_PIG_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/pig
    ++ export TOMCAT_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/bigtop-tomcat
    ++ TOMCAT_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/bigtop-tomcat
    ++ export JSVC_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/libexec/bigtop-utils
    ++ JSVC_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/libexec/bigtop-utils
    ++ export CDH_HADOOP_BIN=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/bin/hadoop
    ++ CDH_HADOOP_BIN=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/bin/hadoop
    ++ export HIVE_DEFAULT_XML=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/conf/hive-default.xml
    ++ HIVE_DEFAULT_XML=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive/conf/hive-default.xml
    + '[' 4 = 4 ']'
    + . /usr/lib64/cmf/service/common/cdh4-default
    ++ export HADOOP_HOME_WARN_SUPPRESS=true
    ++ HADOOP_HOME_WARN_SUPPRESS=true
    ++ export HADOOP_PREFIX=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ HADOOP_PREFIX=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ export HADOOP_LIBEXEC_DIR=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/libexec
    ++ HADOOP_LIBEXEC_DIR=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/libexec
    ++ export HADOOP_CONF_DIR=/var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER
    ++ HADOOP_CONF_DIR=/var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER
    ++ export HADOOP_COMMON_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ HADOOP_COMMON_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop
    ++ export HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs
    ++ HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs
    ++ export HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-mapreduce
    ++ HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-mapreduce
    ++ export YARN_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-yarn
    ++ YARN_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-yarn
    + HDFS_BIN=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs
    + export 'HADOOP_OPTS=-Djava.net.preferIPv4Stack=true '
    + HADOOP_OPTS='-Djava.net.preferIPv4Stack=true '
    + echo 'using /usr/java/jdk1.6.0_37 as JAVA_HOME'
    + echo 'using 4 as CDH_VERSION'
    + echo 'using /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER as CONF_DIR'
    + echo 'using as SECURE_USER'
    + echo 'using as SECURE_GROUP'
    + set_hadoop_classpath
    + set_classpath_in_var HADOOP_CLASSPATH
    + '[' -z HADOOP_CLASSPATH ']'
    + '[' /usr/share/cmf ']'
    ++ find /usr/share/cmf/lib/plugins -name '*.jar'
    ++ tr '\n' :
    + ADD_TO_CP=/usr/share/cmf/lib/plugins/navigator-plugin-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/event-publish-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-4.5.0.jar:
    + ADD_TO_CP=/usr/share/cmf/lib/plugins/navigator-plugin-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/event-publish-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-4.5.0.jar
    + eval 'OLD_VALUE=$HADOOP_CLASSPATH'
    ++ OLD_VALUE=
    + '[' -z ']'
    + export HADOOP_CLASSPATH=/usr/share/cmf/lib/plugins/navigator-plugin-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/event-publish-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-4.5.0.jar
    + HADOOP_CLASSPATH=/usr/share/cmf/lib/plugins/navigator-plugin-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/event-publish-4.5.0-shaded.jar:/usr/share/cmf/lib/plugins/tt-instrumentation-4.5.0.jar
    + set -x
    + perl -pi -e 's#{{CMF_CONF_DIR}}#/var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER#g' /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/core-site.xml /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/hdfs-site.xml
    + '[' -e /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/topology.py ']'
    + '[' -e /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/log4j.properties ']'
    + perl -pi -e 's#{{CMF_CONF_DIR}}#/var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER#g' /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/log4j.properties
    ++ find /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER -maxdepth 1 -name '*.py'
    + OUTPUT=/var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/cloudera_manager_agent_fencer.py
    + '[' /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/cloudera_manager_agent_fencer.py '!=' '' ']'
    + chmod +x /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER/cloudera_manager_agent_fencer.py
    + export HADOOP_IDENT_STRING=hdfs
    + HADOOP_IDENT_STRING=hdfs
    + '[' -n '' ']'
    + acquire_kerberos_tgt hdfs.keytab
    + '[' -z hdfs.keytab ']'
    + '[' -n '' ']'
    + '[' validate-writable-empty-dirs = zkfc ']'
    + '[' file-operation = zkfc ']'
    + '[' bootstrap = zkfc ']'
    + '[' failover = zkfc ']'
    + '[' transition-to-active = zkfc ']'
    + '[' initializeSharedEdits = zkfc ']'
    + '[' initialize-znode = zkfc ']'
    + '[' format-namenode = zkfc ']'
    + '[' monitor-decommission = zkfc ']'
    + '[' jnSyncWait = zkfc ']'
    + '[' nnRpcWait = zkfc ']'
    + '[' monitor-upgrade = zkfc ']'
    + '[' finalize-upgrade = zkfc ']'
    + '[' mkdir = zkfc ']'
    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1397-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
      at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
      at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
      at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
      at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
      at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
      at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
      at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
      at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)


    Here is a part of log from role log:

    03:14:07.951INFOorg.apache.zookeeper.ClientCnxn

    Session establishment complete on server prod-node040.kyc.megafon.ru/10.66.49.207:2181, sessionid = 0x63e41ac865a1d4d, negotiated timeout = 5000

    03:14:07.953INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    03:14:07.959INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    03:14:07.961INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    03:14:07.964INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Session connected.

    03:14:07.969INFOorg.apache.hadoop.ha.ZKFailoverController

    ZK Election indicated that NameNode at prod-node015.kyc.megafon.ru/10.66.49.155:8020 should become standby

    03:14:07.974INFOorg.apache.hadoop.ha.ZKFailoverController

    Successfully transitioned NameNode at prod-node015.kyc.megafon.ru/10.66.49.155:8020 to standby state

    03:15:10.000INFOorg.apache.zookeeper.ClientCnxn

    Unable to read additional data from server sessionid 0x63e41ac865a1d4d, likely server has closed socket, closing socket connection and attempting reconnect

    03:15:10.111FATALorg.apache.hadoop.ha.ActiveStandbyElector

    Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:15:10.888INFOorg.apache.zookeeper.ZooKeeper

    Session: 0x63e41ac865a1d4d closed

    03:15:10.889FATALorg.apache.hadoop.ha.ZKFailoverController

    Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.

    03:15:10.889WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d

    03:15:10.889INFOorg.apache.hadoop.ipc.Server

    Stopping server on 8019

    03:15:10.889WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d

    03:15:10.890WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d

    03:15:10.890WARNorg.apache.hadoop.ha.ActiveStandbyElector

    Ignoring stale result from old client with sessionId 0x63e41ac865a1d4d

    03:15:10.890INFOorg.apache.zookeeper.ClientCnxn

    EventThread shut down

    03:15:10.890INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server Responder

    03:15:10.890INFOorg.apache.hadoop.ipc.Server

    Stopping IPC Server listener on 8019

    03:15:10.890INFOorg.apache.hadoop.ha.ActiveStandbyElector

    Yielding from election

    03:15:10.891INFOorg.apache.hadoop.ha.HealthMonitor

    Stopping HealthMonitor thread

    What do I do wrong?
    четверг, 25 апреля 2013 г., 0:23:45 UTC+4 пользователь bc Wong написал:
    That's your problem. 19 is way way way too many, and you're getting hit by
    intra-ZK sync. For a normal cluster, use 3. Or 5 if you really need the
    redundancy. The largest ZK quorum that I've heard of is 9, supporting
    thousands of client machines.


    On Wed, Apr 24, 2013 at 12:48 PM, Serega Sheypak <serega....@gmail.com<javascript:>
    wrote:
    There are 19 zookeepers.
    I've restarted it yesterday. Two hard-resets for the past two days.


    2013/4/24 bc Wong <bcwa...@cloudera.com <javascript:>>
    How many ZK servers do you have? (Please make sure that you have 3, or
    5.)

    This is a connection problem between the failover controller and ZK. You
    may find more info from the ZK log. I'd suggest that you restart the bad
    failover controller and see if that fixes it.

    Cheers,
    bc


    On Wed, Apr 24, 2013 at 2:25 AM, Serega Sheypak <serega....@gmail.com<javascript:>
    wrote:
    Hi, we've setup HDFS HA. We have two namenodes.
    One of these nodes is is "bad" state. Unfotunately I have no idea why.
    Please help me to understand the root cause of failure.
    Here is the state of roles inside


    <https://lh5.googleusercontent.com/-k_gDO9apZ6U/UXefneKMOmI/AAAAAAAAA9Y/Pzu0vxDgT5I/s1600/01_failoveR_controller.png>

    Here is stderr.log:

    + '[' namenode = zkfc -o secondarynamenode = zkfc -o datanode = zkfc ']'
    + exec /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop-hdfs/bin/hdfs --config /var/run/cloudera-scm-agent/process/1133-hdfs-FAILOVERCONTROLLER zkfc
    Exception in thread "main" java.lang.RuntimeException: ZK Failover Controller failed: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    at org.apache.hadoop.ha.ZKFailoverController.mainLoop(ZKFailoverController.java:359)
    at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:231)
    at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:165)
    at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:161)
    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:452)
    at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:161)
    at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:175)


    Here is sysout.log with error:

    2013-04-24 03:31:02,851 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,509 INFO org.apache.zookeeper.ZooKeeper: Session: 0xc3dd5e58304003c closed
    2013-04-24 03:31:03,509 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received create error from Zookeeper. code:CONNECTIONLOSS for path /hadoop-ha/nameservice1/ActiveStandbyElectorLock. Not retrying further znode create connection errors.
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
    2013-04-24 03:31:03,510 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
    2013-04-24 03:31:03,510 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0xc3dd5e58304003c
    2013-04-24 03:31:03,511 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down


    What Do we do wrong?


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupscm-users @
categorieshadoop
postedApr 24, '13 at 9:25a
activeMay 17, '13 at 8:50p
posts12
users3
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase