FAQ
Hi,

so in our production, we see temporary networking failures (we are not quite
100% sure what they are) but now and then region server's zookeeper session
would get expired and in addition some ipc channels would throw 'channel
closed'.

This causes region server to exit. Which is not a very big deal, our
monitoring system would send a text message so somebody would restart the
region server.

however, this does happen a little more often than we probably would have
liked to do it manually.

Why is server not recovering/reconnecting automatically? is there a facility
to enable server restarts and region server nodes to rejoin the cluster
automatically?

Thanks.
-Dmitriy

Search Discussions

  • Ryan Rawson at Sep 22, 2010 at 12:24 am
    You could wrap the regionserver in a script that auto-reboots them?

    We cant really recover from this scenario, because the master notices
    we are dead, then splits our logs and reassigns the regions to other
    nodes. This is the basis of how reliable hbase works in the face of
    machine failure.

    -ryan
    On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov wrote:
    Hi,

    so in our production, we see temporary networking failures (we are not quite
    100% sure what they are) but now and then region server's zookeeper session
    would get expired and in addition some ipc channels would throw 'channel
    closed'.

    This causes region server to exit. Which is not a very big deal, our
    monitoring system would send a text message so somebody would restart the
    region server.

    however, this does happen a little more often than we probably would have
    liked to do it manually.

    Why is server not recovering/reconnecting automatically? is there a facility
    to enable server restarts and region server nodes to rejoin the cluster
    automatically?

    Thanks.
    -Dmitriy
  • Dmitriy Lyubimov at Sep 22, 2010 at 12:30 am
    Thanks a lot, Ryan.

    That's what i thought, I knew this explanation that the regions are split;
    although I guess one might reason there's no reason why we can't try to
    start a new life by rejoining cluster again as a new region server (but the
    same process). Or at least have such an option. Just wanted to double-check
    before wrapping it into some sort of a kicker.
    -Dmitriy

    On Tue, Sep 21, 2010 at 5:24 PM, Ryan Rawson wrote:

    You could wrap the regionserver in a script that auto-reboots them?

    We cant really recover from this scenario, because the master notices
    we are dead, then splits our logs and reassigns the regions to other
    nodes. This is the basis of how reliable hbase works in the face of
    machine failure.

    -ryan
    On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov wrote:
    Hi,

    so in our production, we see temporary networking failures (we are not quite
    100% sure what they are) but now and then region server's zookeeper session
    would get expired and in addition some ipc channels would throw 'channel
    closed'.

    This causes region server to exit. Which is not a very big deal, our
    monitoring system would send a text message so somebody would restart the
    region server.

    however, this does happen a little more often than we probably would have
    liked to do it manually.

    Why is server not recovering/reconnecting automatically? is there a facility
    to enable server restarts and region server nodes to rejoin the cluster
    automatically?

    Thanks.
    -Dmitriy
  • Ryan Rawson at Sep 22, 2010 at 12:31 am
    We tried that before, but some things are difficult to reset in the same JVM.

    A clean restart just works better :-)
    On Tue, Sep 21, 2010 at 5:29 PM, Dmitriy Lyubimov wrote:
    Thanks a lot, Ryan.

    That's what i thought, I knew this explanation that the regions are split;
    although I guess one might reason there's no reason why we can't try to
    start a new life by rejoining cluster again as a new region server (but the
    same process). Or at least have such an option. Just wanted to double-check
    before wrapping it into some sort of a kicker.
    -Dmitriy

    On Tue, Sep 21, 2010 at 5:24 PM, Ryan Rawson wrote:

    You could wrap the regionserver in a script that auto-reboots them?

    We cant really recover from this scenario, because the master notices
    we are dead, then splits our logs and reassigns the regions to other
    nodes.  This is the basis of how reliable hbase works in the face of
    machine failure.

    -ryan

    On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
    wrote:
    Hi,

    so in our production, we see temporary networking failures (we are not quite
    100% sure what they are) but now and then region server's zookeeper session
    would get expired and in addition some ipc channels would throw 'channel
    closed'.

    This causes region server to exit. Which is not a very big deal, our
    monitoring system would send a text message so somebody would restart the
    region server.

    however, this does happen a little more often than we probably would have
    liked to do it manually.

    Why is server not recovering/reconnecting automatically? is there a facility
    to enable server restarts and region server nodes to rejoin the cluster
    automatically?

    Thanks.
    -Dmitriy
  • Matthew LeMieux at Sep 22, 2010 at 12:37 am
    What are the JVM limitations that you were you running into?

    -Matthew
    On Sep 21, 2010, at 5:31 PM, Ryan Rawson wrote:

    We tried that before, but some things are difficult to reset in the same JVM.

    A clean restart just works better :-)
    On Tue, Sep 21, 2010 at 5:29 PM, Dmitriy Lyubimov wrote:
    Thanks a lot, Ryan.

    That's what i thought, I knew this explanation that the regions are split;
    although I guess one might reason there's no reason why we can't try to
    start a new life by rejoining cluster again as a new region server (but the
    same process). Or at least have such an option. Just wanted to double-check
    before wrapping it into some sort of a kicker.
    -Dmitriy

    On Tue, Sep 21, 2010 at 5:24 PM, Ryan Rawson wrote:

    You could wrap the regionserver in a script that auto-reboots them?

    We cant really recover from this scenario, because the master notices
    we are dead, then splits our logs and reassigns the regions to other
    nodes. This is the basis of how reliable hbase works in the face of
    machine failure.

    -ryan

    On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
    wrote:
    Hi,

    so in our production, we see temporary networking failures (we are not quite
    100% sure what they are) but now and then region server's zookeeper session
    would get expired and in addition some ipc channels would throw 'channel
    closed'.

    This causes region server to exit. Which is not a very big deal, our
    monitoring system would send a text message so somebody would restart the
    region server.

    however, this does happen a little more often than we probably would have
    liked to do it manually.

    Why is server not recovering/reconnecting automatically? is there a facility
    to enable server restarts and region server nodes to rejoin the cluster
    automatically?

    Thanks.
    -Dmitriy
  • Ryan Rawson at Sep 22, 2010 at 12:39 am
    no JVM limitations, but some code is just not really meant to be
    restarted within the same JVM and things just didnt work out well.
    Specifically the DFSClient code, and I think we had to hack a bunch to
    make the ZK sessions reconnect because you have to re-init the entire
    stack.

    When you have a bunch of code that assumes a static gets initialized
    once and never again that doesnt make for a easy reinitialize.
    On Tue, Sep 21, 2010 at 5:36 PM, Matthew LeMieux wrote:
    What are the JVM limitations that you were you running into?

    -Matthew
    On Sep 21, 2010, at 5:31 PM, Ryan Rawson wrote:

    We tried that before, but some things are difficult to reset in the same JVM.

    A clean restart just works better :-)
    On Tue, Sep 21, 2010 at 5:29 PM, Dmitriy Lyubimov wrote:
    Thanks a lot, Ryan.

    That's what i thought, I knew this explanation that the regions are split;
    although I guess one might reason there's no reason why we can't try to
    start a new life by rejoining cluster again as a new region server (but the
    same process). Or at least have such an option. Just wanted to double-check
    before wrapping it into some sort of a kicker.
    -Dmitriy

    On Tue, Sep 21, 2010 at 5:24 PM, Ryan Rawson wrote:

    You could wrap the regionserver in a script that auto-reboots them?

    We cant really recover from this scenario, because the master notices
    we are dead, then splits our logs and reassigns the regions to other
    nodes.  This is the basis of how reliable hbase works in the face of
    machine failure.

    -ryan

    On Tue, Sep 21, 2010 at 5:20 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
    wrote:
    Hi,

    so in our production, we see temporary networking failures (we are not quite
    100% sure what they are) but now and then region server's zookeeper session
    would get expired and in addition some ipc channels would throw 'channel
    closed'.

    This causes region server to exit. Which is not a very big deal, our
    monitoring system would send a text message so somebody would restart the
    region server.

    however, this does happen a little more often than we probably would have
    liked to do it manually.

    Why is server not recovering/reconnecting automatically? is there a facility
    to enable server restarts and region server nodes to rejoin the cluster
    automatically?

    Thanks.
    -Dmitriy

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedSep 22, '10 at 12:21a
activeSep 22, '10 at 12:39a
posts6
users3
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase