FAQ
Hi guys,

We are testing with Storm 0.8.1, and seeing a weird Zookeeper
crash/timeouts when testing Storm in local mode. We turned up DEBUG
logging, and while Storm is spinning up spouts and bolts, you start to see
connection failures to ZK:

[2012-12-04 09:22:34,546] SyncThread:0 server.FinalRequestProcessor DEBUG
sessionid:0x13b66f0c4eb000b type:exists cxid:0x27 zxid:0xfffffffffffffffe
txntype:unknown reqpath:/storm/errors
[2012-12-04 09:22:34,546] Thread-137-SendThread(localhost:2000)
zookeeper.ClientCnxn DEBUG Reading reply sessionid:0x13b66f0c4eb000b,
packet:: clientPath:null serverPath:null finished:false header:: 39,3
replyHeader:: 39,32,0 request:: '/storm/errors,F response::
s{9,9,1354641753361,1354641753361,0,0,0,0,1,0,9}
[2012-12-04 09:22:34,548] Thread-137 daemon.executor INFO Loading executor
concurrent_provider:[6 6]
[2012-12-04 09:22:34,549] Thread-137 daemon.task INFO Emitting:
concurrent_provider __system ["startup"]
[2012-12-04 09:22:34,549] Thread-137 daemon.executor INFO Loaded executor
tasks concurrent_provider:[6 6]
[2012-12-04 09:22:34,550] Thread-137 daemon.executor INFO Preparing bolt
concurrent_provider:(6)
[2012-12-04 09:22:34,551] pool-156-thread-1 thrift.ThriftConverter DEBUG
java.net.ConnectException: Connection refused
[2012-12-04 09:22:34,696] Thread-42 daemon.worker DEBUG Doing heartbeat
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1354641754,
:storm-id "huginn_viewers-1-1354641722", :executors #{[2 2] [3 3] [4 4] [5
5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15]
[16 16] [17 17] [18 18] [19 19] [20 20] [21 21] [1 1]}, :port 1}
[2012-12-04 09:22:34,805] RetryService : 127.0.0.1(127.0.0.1):9160
thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused
[2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
zookeeper.ClientCnxn INFO Opening socket connection to server
localhost/0:0:0:0:0:0:0:1:2000
[2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
zookeeper.ClientCnxn INFO Socket connection established to
localhost/0:0:0:0:0:0:0:1:2000, initiating session
[2012-12-04 09:22:34,806] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000
server.NIOServerCnxn INFO Accepted socket connection from
/0:0:0:0:0:0:0:1%0:50026
[2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
zookeeper.ClientCnxn DEBUG Session establishment request sent on
localhost/0:0:0:0:0:0:0:1:2000
[2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000
server.NIOServerCnxn DEBUG Session establishment request from client
/0:0:0:0:0:0:0:1%0:50026 client's lastZxid is 0x28
[2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000
server.NIOServerCnxn INFO Refusing session request for client
/0:0:0:0:0:0:0:1%0:50026 as it has seen zxid 0x28 our last zxid is 0x20
client must try another server
[2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000
server.NIOServerCnxn INFO Closed socket connection for client
/0:0:0:0:0:0:0:1%0:50026 (no session established for client)
[2012-12-04 09:22:34,807] Thread-38-SendThread(localhost:2000)
zookeeper.ClientCnxn INFO Unable to read additional data from server
sessionid 0x13b66f04a74000b, likely server has closed socket, closing
socket connection and attempting reconnect
[2012-12-04 09:22:34,807] Thread-38-SendThread(localhost:2000)
zookeeper.ClientCnxn DEBUG Ignoring exception during shutdown input
java.net.SocketException: Socket is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at
sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:665)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
at
org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1205)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1170)
[2012-12-04 09:22:34,870] RetryService : 127.0.0.1(127.0.0.1):9160
thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused
[2012-12-04 09:22:34,933] RetryService : 127.0.0.1(127.0.0.1):9160
thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused

This continues for a while, with some threads still making ZK connections
and communicating, but other threads erroring out with ZK connection errors.

Soon the # of retries are used up and you start seeing stack traces:

[2012-12-04 09:22:39,067] Thread-77 curator.ConnectionState ERROR
Connection timed out
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at
com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:72)
at
com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:74)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:353)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:149)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:138)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:134)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:125)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:34)
at
backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:78)
at backtype.storm.zookeeper$exists.invoke(zookeeper.clj:117)
at
backtype.storm.cluster$mk_distributed_cluster_state$reify__1973.set_data(cluster.clj:70)
at
backtype.storm.cluster$mk_storm_cluster_state$reify__2392.worker_heartbeat_BANG_(cluster.clj:276)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at
backtype.storm.daemon.worker$do_executor_heartbeats.doInvoke(worker.clj:35)
at clojure.lang.RestFn.invoke(RestFn.java:439)
at
backtype.storm.daemon.worker$fn__4324$exec_fn__1207__auto____4325$fn__4328.invoke(worker.clj:345)
at
backtype.storm.timer$schedule_recurring$this__1753.invoke(timer.clj:69)
at
backtype.storm.timer$mk_timer$fn__1736$fn__1737.invoke(timer.clj:33)
at backtype.storm.timer$mk_timer$fn__1736.invoke(timer.clj:26)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:680)
[2012-12-04 09:22:39,068] Thread-77 curator.RetryLoop DEBUG Retry-able
exception received
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss

Soon nimbus master shuts down:

[2012-12-04 09:22:43,428] Thread-134 daemon.supervisor DEBUG Assigned
executors: {}
[2012-12-04 09:22:43,428] Thread-134 daemon.supervisor DEBUG Allocated: {}
[2012-12-04 09:22:43,506] Thread-138 daemon.worker DEBUG Doing heartbeat
#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1354641763,
:storm-id "huginn_viewers-1-1354641753"
, :executors #{[2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11
11] [12 12] [13 13] [14 14] [15 15] [16 16] [17 17] [18 18] [19 19] [20 20]
[21 21] [1 1]}, :port 4}
[2012-12-04 09:22:43,531] pool-8-thread-6 daemon.nimbus INFO Shutting down
master
[2012-12-04 09:22:43,532] pool-8-thread-6 imps.CuratorFrameworkImpl DEBUG
Closing
[2012-12-04 09:22:43,532] pool-8-thread-6 curator.CuratorZookeeperClient
DEBUG Closing

The funny thing is that this was just working for us (on 0.8.0) recently,
but then it started failing for some of us, and upgrading to 0.8.1 didn't
fix it.

It's as though ZK is running out of connections.

Any ideas?

thanks,
Evan

Search Discussions

  • Nathan Marz at Dec 6, 2012 at 9:53 pm
    Did any of your dependencies change? Are you overriding the ZK dependency,
    for instance?
    On Tue, Dec 4, 2012 at 5:51 PM, Evan Chan wrote:

    Hi guys,

    We are testing with Storm 0.8.1, and seeing a weird Zookeeper
    crash/timeouts when testing Storm in local mode. We turned up DEBUG
    logging, and while Storm is spinning up spouts and bolts, you start to see
    connection failures to ZK:

    [2012-12-04 09:22:34,546] SyncThread:0 server.FinalRequestProcessor DEBUG
    sessionid:0x13b66f0c4eb000b type:exists cxid:0x27 zxid:0xfffffffffffffffe
    txntype:unknown reqpath:/storm/errors
    [2012-12-04 09:22:34,546] Thread-137-SendThread(localhost:2000)
    zookeeper.ClientCnxn DEBUG Reading reply sessionid:0x13b66f0c4eb000b,
    packet:: clientPath:null serverPath:null finished:false header:: 39,3
    replyHeader:: 39,32,0 request:: '/storm/errors,F response::
    s{9,9,1354641753361,1354641753361,0,0,0,0,1,0,9}
    [2012-12-04 09:22:34,548] Thread-137 daemon.executor INFO Loading executor
    concurrent_provider:[6 6]
    [2012-12-04 09:22:34,549] Thread-137 daemon.task INFO Emitting:
    concurrent_provider __system ["startup"]
    [2012-12-04 09:22:34,549] Thread-137 daemon.executor INFO Loaded executor
    tasks concurrent_provider:[6 6]
    [2012-12-04 09:22:34,550] Thread-137 daemon.executor INFO Preparing bolt
    concurrent_provider:(6)
    [2012-12-04 09:22:34,551] pool-156-thread-1 thrift.ThriftConverter DEBUG
    java.net.ConnectException: Connection refused
    [2012-12-04 09:22:34,696] Thread-42 daemon.worker DEBUG Doing heartbeat
    #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1354641754,
    :storm-id "huginn_viewers-1-1354641722", :executors #{[2 2] [3 3] [4 4] [5
    5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15]
    [16 16] [17 17] [18 18] [19 19] [20 20] [21 21] [1 1]}, :port 1}
    [2012-12-04 09:22:34,805] RetryService : 127.0.0.1(127.0.0.1):9160
    thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused
    [2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn INFO Opening socket connection to server
    localhost/0:0:0:0:0:0:0:1:2000
    [2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn INFO Socket connection established to
    localhost/0:0:0:0:0:0:0:1:2000, initiating session
    [2012-12-04 09:22:34,806] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn INFO Accepted socket connection from
    /0:0:0:0:0:0:0:1%0:50026
    [2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn DEBUG Session establishment request sent on
    localhost/0:0:0:0:0:0:0:1:2000
    [2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn DEBUG Session establishment request from client
    /0:0:0:0:0:0:0:1%0:50026 client's lastZxid is 0x28
    [2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn INFO Refusing session request for client
    /0:0:0:0:0:0:0:1%0:50026 as it has seen zxid 0x28 our last zxid is 0x20
    client must try another server
    [2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn INFO Closed socket connection for client
    /0:0:0:0:0:0:0:1%0:50026 (no session established for client)
    [2012-12-04 09:22:34,807] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn INFO Unable to read additional data from server
    sessionid 0x13b66f04a74000b, likely server has closed socket, closing
    socket connection and attempting reconnect
    [2012-12-04 09:22:34,807] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn DEBUG Ignoring exception during shutdown input
    java.net.SocketException: Socket is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
    sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:665)
    at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
    at
    org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1205)
    at
    org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1170)
    [2012-12-04 09:22:34,870] RetryService : 127.0.0.1(127.0.0.1):9160
    thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused
    [2012-12-04 09:22:34,933] RetryService : 127.0.0.1(127.0.0.1):9160
    thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused

    This continues for a while, with some threads still making ZK connections
    and communicating, but other threads erroring out with ZK connection errors.

    Soon the # of retries are used up and you start seeing stack traces:

    [2012-12-04 09:22:39,067] Thread-77 curator.ConnectionState ERROR
    Connection timed out
    org.apache.zookeeper.KeeperException$ConnectionLossException:
    KeeperErrorCode = ConnectionLoss
    at
    com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:72)
    at
    com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:74)
    at
    com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:353)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:149)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:138)
    at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:134)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:125)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:34)
    at
    backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:78)
    at backtype.storm.zookeeper$exists.invoke(zookeeper.clj:117)
    at
    backtype.storm.cluster$mk_distributed_cluster_state$reify__1973.set_data(cluster.clj:70)
    at
    backtype.storm.cluster$mk_storm_cluster_state$reify__2392.worker_heartbeat_BANG_(cluster.clj:276)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
    at
    backtype.storm.daemon.worker$do_executor_heartbeats.doInvoke(worker.clj:35)
    at clojure.lang.RestFn.invoke(RestFn.java:439)
    at
    backtype.storm.daemon.worker$fn__4324$exec_fn__1207__auto____4325$fn__4328.invoke(worker.clj:345)
    at
    backtype.storm.timer$schedule_recurring$this__1753.invoke(timer.clj:69)
    at
    backtype.storm.timer$mk_timer$fn__1736$fn__1737.invoke(timer.clj:33)
    at backtype.storm.timer$mk_timer$fn__1736.invoke(timer.clj:26)
    at clojure.lang.AFn.run(AFn.java:24)
    at java.lang.Thread.run(Thread.java:680)
    [2012-12-04 09:22:39,068] Thread-77 curator.RetryLoop DEBUG Retry-able
    exception received
    org.apache.zookeeper.KeeperException$ConnectionLossException:
    KeeperErrorCode = ConnectionLoss

    Soon nimbus master shuts down:

    [2012-12-04 09:22:43,428] Thread-134 daemon.supervisor DEBUG Assigned
    executors: {}
    [2012-12-04 09:22:43,428] Thread-134 daemon.supervisor DEBUG Allocated: {}
    [2012-12-04 09:22:43,506] Thread-138 daemon.worker DEBUG Doing heartbeat
    #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1354641763,
    :storm-id "huginn_viewers-1-1354641753"
    , :executors #{[2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11
    11] [12 12] [13 13] [14 14] [15 15] [16 16] [17 17] [18 18] [19 19] [20 20]
    [21 21] [1 1]}, :port 4}
    [2012-12-04 09:22:43,531] pool-8-thread-6 daemon.nimbus INFO Shutting down
    master
    [2012-12-04 09:22:43,532] pool-8-thread-6 imps.CuratorFrameworkImpl DEBUG
    Closing
    [2012-12-04 09:22:43,532] pool-8-thread-6 curator.CuratorZookeeperClient
    DEBUG Closing

    The funny thing is that this was just working for us (on 0.8.0) recently,
    but then it started failing for some of us, and upgrading to 0.8.1 didn't
    fix it.

    It's as though ZK is running out of connections.

    Any ideas?

    thanks,
    Evan

    --
    Twitter: @nathanmarz
    http://nathanmarz.com
  • Evan Chan at Dec 6, 2012 at 10:24 pm
    Please ignore, we solved the problem internally, but for some reason this
    post was re-sent.
    On Tue, Dec 4, 2012 at 9:51 AM, Evan Chan wrote:

    Hi guys,

    We are testing with Storm 0.8.1, and seeing a weird Zookeeper
    crash/timeouts when testing Storm in local mode. We turned up DEBUG
    logging, and while Storm is spinning up spouts and bolts, you start to see
    connection failures to ZK:

    [2012-12-04 09:22:34,546] SyncThread:0 server.FinalRequestProcessor DEBUG
    sessionid:0x13b66f0c4eb000b type:exists cxid:0x27 zxid:0xfffffffffffffffe
    txntype:unknown reqpath:/storm/errors
    [2012-12-04 09:22:34,546] Thread-137-SendThread(localhost:2000)
    zookeeper.ClientCnxn DEBUG Reading reply sessionid:0x13b66f0c4eb000b,
    packet:: clientPath:null serverPath:null finished:false header:: 39,3
    replyHeader:: 39,32,0 request:: '/storm/errors,F response::
    s{9,9,1354641753361,1354641753361,0,0,0,0,1,0,9}
    [2012-12-04 09:22:34,548] Thread-137 daemon.executor INFO Loading executor
    concurrent_provider:[6 6]
    [2012-12-04 09:22:34,549] Thread-137 daemon.task INFO Emitting:
    concurrent_provider __system ["startup"]
    [2012-12-04 09:22:34,549] Thread-137 daemon.executor INFO Loaded executor
    tasks concurrent_provider:[6 6]
    [2012-12-04 09:22:34,550] Thread-137 daemon.executor INFO Preparing bolt
    concurrent_provider:(6)
    [2012-12-04 09:22:34,551] pool-156-thread-1 thrift.ThriftConverter DEBUG
    java.net.ConnectException: Connection refused
    [2012-12-04 09:22:34,696] Thread-42 daemon.worker DEBUG Doing heartbeat
    #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1354641754,
    :storm-id "huginn_viewers-1-1354641722", :executors #{[2 2] [3 3] [4 4] [5
    5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15]
    [16 16] [17 17] [18 18] [19 19] [20 20] [21 21] [1 1]}, :port 1}
    [2012-12-04 09:22:34,805] RetryService : 127.0.0.1(127.0.0.1):9160
    thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused
    [2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn INFO Opening socket connection to server
    localhost/0:0:0:0:0:0:0:1:2000
    [2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn INFO Socket connection established to
    localhost/0:0:0:0:0:0:0:1:2000, initiating session
    [2012-12-04 09:22:34,806] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn INFO Accepted socket connection from
    /0:0:0:0:0:0:0:1%0:50026
    [2012-12-04 09:22:34,806] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn DEBUG Session establishment request sent on
    localhost/0:0:0:0:0:0:0:1:2000
    [2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn DEBUG Session establishment request from client
    /0:0:0:0:0:0:0:1%0:50026 client's lastZxid is 0x28
    [2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn INFO Refusing session request for client
    /0:0:0:0:0:0:0:1%0:50026 as it has seen zxid 0x28 our last zxid is 0x20
    client must try another server
    [2012-12-04 09:22:34,807] NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000server.NIOServerCnxn INFO Closed socket connection for client
    /0:0:0:0:0:0:0:1%0:50026 (no session established for client)
    [2012-12-04 09:22:34,807] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn INFO Unable to read additional data from server
    sessionid 0x13b66f04a74000b, likely server has closed socket, closing
    socket connection and attempting reconnect
    [2012-12-04 09:22:34,807] Thread-38-SendThread(localhost:2000)
    zookeeper.ClientCnxn DEBUG Ignoring exception during shutdown input
    java.net.SocketException: Socket is not connected
    at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
    at
    sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:665)
    at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
    at
    org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1205)
    at
    org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1170)
    [2012-12-04 09:22:34,870] RetryService : 127.0.0.1(127.0.0.1):9160
    thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused
    [2012-12-04 09:22:34,933] RetryService : 127.0.0.1(127.0.0.1):9160
    thrift.ThriftConverter DEBUG java.net.ConnectException: Connection refused

    This continues for a while, with some threads still making ZK connections
    and communicating, but other threads erroring out with ZK connection errors.

    Soon the # of retries are used up and you start seeing stack traces:

    [2012-12-04 09:22:39,067] Thread-77 curator.ConnectionState ERROR
    Connection timed out
    org.apache.zookeeper.KeeperException$ConnectionLossException:
    KeeperErrorCode = ConnectionLoss
    at
    com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:72)
    at
    com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:74)
    at
    com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:353)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:149)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:138)
    at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:134)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:125)
    at
    com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:34)
    at
    backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:78)
    at backtype.storm.zookeeper$exists.invoke(zookeeper.clj:117)
    at
    backtype.storm.cluster$mk_distributed_cluster_state$reify__1973.set_data(cluster.clj:70)
    at
    backtype.storm.cluster$mk_storm_cluster_state$reify__2392.worker_heartbeat_BANG_(cluster.clj:276)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
    at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
    at
    backtype.storm.daemon.worker$do_executor_heartbeats.doInvoke(worker.clj:35)
    at clojure.lang.RestFn.invoke(RestFn.java:439)
    at
    backtype.storm.daemon.worker$fn__4324$exec_fn__1207__auto____4325$fn__4328.invoke(worker.clj:345)
    at
    backtype.storm.timer$schedule_recurring$this__1753.invoke(timer.clj:69)
    at
    backtype.storm.timer$mk_timer$fn__1736$fn__1737.invoke(timer.clj:33)
    at backtype.storm.timer$mk_timer$fn__1736.invoke(timer.clj:26)
    at clojure.lang.AFn.run(AFn.java:24)
    at java.lang.Thread.run(Thread.java:680)
    [2012-12-04 09:22:39,068] Thread-77 curator.RetryLoop DEBUG Retry-able
    exception received
    org.apache.zookeeper.KeeperException$ConnectionLossException:
    KeeperErrorCode = ConnectionLoss

    Soon nimbus master shuts down:

    [2012-12-04 09:22:43,428] Thread-134 daemon.supervisor DEBUG Assigned
    executors: {}
    [2012-12-04 09:22:43,428] Thread-134 daemon.supervisor DEBUG Allocated: {}
    [2012-12-04 09:22:43,506] Thread-138 daemon.worker DEBUG Doing heartbeat
    #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1354641763,
    :storm-id "huginn_viewers-1-1354641753"
    , :executors #{[2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11
    11] [12 12] [13 13] [14 14] [15 15] [16 16] [17 17] [18 18] [19 19] [20 20]
    [21 21] [1 1]}, :port 4}
    [2012-12-04 09:22:43,531] pool-8-thread-6 daemon.nimbus INFO Shutting down
    master
    [2012-12-04 09:22:43,532] pool-8-thread-6 imps.CuratorFrameworkImpl DEBUG
    Closing
    [2012-12-04 09:22:43,532] pool-8-thread-6 curator.CuratorZookeeperClient
    DEBUG Closing

    The funny thing is that this was just working for us (on 0.8.0) recently,
    but then it started failing for some of us, and upgrading to 0.8.1 didn't
    fix it.

    It's as though ZK is running out of connections.

    Any ideas?

    thanks,
    Evan

    --
    --
    *Evan Chan*
    Senior Software Engineer |
    ev@ooyala.com | (650) 996-4600
    www.ooyala.com | blog <http://www.ooyala.com/blog> |
    @ooyala<http://www.twitter.com/ooyala>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupstorm-user @
postedDec 6, '12 at 9:47p
activeDec 6, '12 at 10:24p
posts3
users2
websitestorm-project.net
irc#storm-user

2 users in discussion

Evan Chan: 2 posts Nathan Marz: 1 post

People

Translate

site design / logo © 2022 Grokbase