Grokbase Groups HBase user May 2011
FAQ
Hi,

I'm testing failing over from one master to another by stopping
master1(master2 is always running). Master2 web i/f kicks in and I can
zk_dump but the region servers never show up. Logs on master2 show repeated
entries below:

2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
Waiting on regionserver(s) to checkin
2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
Waiting on regionserver(s) to checkin

Obviously the RS are not checking in. Not sure why.

Any ideas?

thx,

--
Sean Barden
sbarden@gmail.com

Search Discussions

  • Sean barden at May 5, 2011 at 4:24 pm
    Hi,
    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I
    can zk_dump but the region servers never show up.  Logs on master2
    show repeated entries below:
    2011-05-05 09:10:05,938 INFO
    org.apache.hadoop.hbase.master.ServerManager: Waiting on
    regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO
    org.apache.hadoop.hbase.master.ServerManager: Waiting on
    regionserver(s) to checkin
    Obviously the RS are not checking in.  Not sure why.
    Any ideas?
    thx,
    --
    Sean Barden
    sbarden@gmail.com



    --
    Sean Barden
    sbarden@gmail.com
  • Jean-Daniel Cryans at May 5, 2011 at 4:49 pm
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com
  • Sean barden at May 5, 2011 at 4:56 pm
    Looks like my issue. We're using 0.90.1-CDH3B4 . Looks like an
    upgrade is in order. Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Jean-Daniel Cryans at May 5, 2011 at 4:59 pm
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Sean barden at May 13, 2011 at 7:09 pm
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade). Hbase failed over to it's "backup" master server just find
    in the older version. As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue. However, I'm having the same issue.
    master1 fails or I shut it down, master2 waits for RS'es to check in
    forever. Restarting the services for master2 and all RS's does
    nothing until I start up master1. So, essentially, I have no failover
    for a critical component of my infrastructure. Needless to say I'm
    exceptionally frustrated. Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Jean-Daniel Cryans at May 13, 2011 at 7:35 pm
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Dmitriy Lyubimov at May 13, 2011 at 10:39 pm
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Dmitriy Lyubimov at May 14, 2011 at 12:33 am
    ok the problem seems to be multi-nic hosting on masters. the hbase
    master starts up and uses canonical hostname to listen on which points
    to a wrong nic. I am not sure why so i am not changign this but i am
    struggling to override this at the moment as nothing seems to work
    (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
    possible combinatiosn... it probably has something to do with reverse
    lookup so i added entry to hosts files to no avail so far. i will have
    to talk to our admins to see why we can't switch the canonical host
    name to ip that all the nodes are supposed to use it with .

    thanks.
    -d
    On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov wrote:
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Dmitriy Lyubimov at May 14, 2011 at 1:18 am
    Ok i think the issue is largely solved. Thanks for your help, guys.

    -d
    On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov wrote:
    ok the problem seems to be multi-nic hosting on masters. the hbase
    master starts up and uses canonical hostname to listen on which points
    to a wrong nic. I am not sure why so i am not changign this but i am
    struggling to override this at the moment as nothing seems to work
    (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
    possible combinatiosn... it probably has something to do with reverse
    lookup so i added entry to hosts files to no avail so far. i will have
    to talk to our admins to see why we can't switch the canonical host
    name to ip that all the nodes are supposed to use it with .

    thanks.
    -d
    On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov wrote:
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Stack at May 14, 2011 at 10:05 pm
    What did you do to solve it?
    Thanks,
    St.Ack
    On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov wrote:
    Ok i think the issue is largely solved. Thanks for your help, guys.

    -d
    On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov wrote:
    ok the problem seems to be multi-nic hosting on masters. the hbase
    master starts up and uses canonical hostname to listen on which points
    to a wrong nic. I am not sure why so i am not changign this but i am
    struggling to override this at the moment as nothing seems to work
    (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
    possible combinatiosn... it probably has something to do with reverse
    lookup so i added entry to hosts files to no avail so far. i will have
    to talk to our admins to see why we can't switch the canonical host
    name to ip that all the nodes are supposed to use it with .

    thanks.
    -d
    On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov wrote:
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Gaojinchao at May 15, 2011 at 1:21 am
    I have see this issue too. But mine is garbage date and delete it.
    Linux supports a couple of host in hosts. I want to what you will do ?

    -----邮件原件-----
    发件人: saint.ack@gmail.com 代表 Stack
    发送时间: 2011年5月15日 6:05
    收件人: user@hbase.apache.org
    主题: Re: Hbase Master Failover Issue

    What did you do to solve it?
    Thanks,
    St.Ack
    On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov wrote:
    Ok i think the issue is largely solved. Thanks for your help, guys.

    -d
    On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov wrote:
    ok the problem seems to be multi-nic hosting on masters. the hbase
    master starts up and uses canonical hostname to listen on which points
    to a wrong nic. I am not sure why so i am not changign this but i am
    struggling to override this at the moment as nothing seems to work
    (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
    possible combinatiosn... it probably has something to do with reverse
    lookup so i added entry to hosts files to no avail so far. i will have
    to talk to our admins to see why we can't switch the canonical host
    name to ip that all the nodes are supposed to use it with .

    thanks.
    -d
    On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov wrote:
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Dmitriy Lyubimov at May 15, 2011 at 7:50 pm
    The problem was multinic configuration at master nodes.

    i saw that the processes starts listening on a wrong NIC

    I read the source code and saw that with default settings it would use
    whatever ip is reported by canonical hostname, i.e. whatever retruned
    by something like

    ping `hostname`,


    our canonical hostname was resolving of course the wrong nic.

    i kind of did not want to edit /etc/hostsnames (i guessed our admins
    had a reason to point hostname to that nic), so i forcefully set
    'eth0' as hbase.master.dns.interface (if i remember that property name
    correctly).

    it started listening on what was pointed by eth0:0 isntead of eth0
    which solved the problem anyway.

    (funny thing though i still couldn't make it listen on eth0 ip, but
    rather on eth0:0 only although both had reverse dns. apparently
    whatever native code is used, lists both ips for that interface and
    then the first one that has reverse dns is used, so there's no way to
    force it to listen on other ones).

    Bottom line, with multinic configurations your hostname better points
    to the ip you want it to listen on in /etc/hosts. If it's different,
    one cannot use the default configuration.

    -d
    On Sat, May 14, 2011 at 3:04 PM, Stack wrote:
    What did you do to solve it?
    Thanks,
    St.Ack
    On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov wrote:
    Ok i think the issue is largely solved. Thanks for your help, guys.

    -d
    On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov wrote:
    ok the problem seems to be multi-nic hosting on masters. the hbase
    master starts up and uses canonical hostname to listen on which points
    to a wrong nic. I am not sure why so i am not changign this but i am
    struggling to override this at the moment as nothing seems to work
    (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
    possible combinatiosn... it probably has something to do with reverse
    lookup so i added entry to hosts files to no avail so far. i will have
    to talk to our admins to see why we can't switch the canonical host
    name to ip that all the nodes are supposed to use it with .

    thanks.
    -d
    On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov wrote:
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Jean-Daniel Cryans at May 16, 2011 at 6:59 pm
    Hey Dmitriy,

    Awesome you could figure it out. I wonder if there's something that
    could be done in HBase to help debugging such problems... Suggestions?

    Also, just to make sure, this thread was started by Sean and it seems
    you stepped up for him... you are working together right? At least
    that's what Rapportive tells me, but still trying to make sure we
    didn't forget someone else's problem.

    Good on you,

    J-D
    On Sun, May 15, 2011 at 12:50 PM, Dmitriy Lyubimov wrote:
    The problem was multinic configuration at master nodes.

    i saw that the processes starts listening on a wrong NIC

    I read the source code and saw that with default settings it would use
    whatever ip is reported by canonical hostname, i.e. whatever retruned
    by something like

    ping `hostname`,


    our canonical hostname was resolving of course  the wrong nic.

    i kind of did not want to edit /etc/hostsnames (i guessed our admins
    had a reason to point hostname to that nic), so i forcefully set
    'eth0' as hbase.master.dns.interface (if i remember that property name
    correctly).

    it started listening on what was pointed by eth0:0 isntead of eth0
    which solved the problem anyway.

    (funny thing though i still couldn't make it listen on eth0 ip, but
    rather on eth0:0 only although both had reverse dns. apparently
    whatever native code is used, lists both ips for that interface and
    then the first one that has reverse dns is used, so there's no way to
    force it to listen on other ones).

    Bottom line, with multinic configurations your hostname better points
    to the ip you want it to listen on in /etc/hosts. If it's different,
    one cannot use the default configuration.

    -d
    On Sat, May 14, 2011 at 3:04 PM, Stack wrote:
    What did you do to solve it?
    Thanks,
    St.Ack
    On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov wrote:
    Ok i think the issue is largely solved. Thanks for your help, guys.

    -d
    On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov wrote:
    ok the problem seems to be multi-nic hosting on masters. the hbase
    master starts up and uses canonical hostname to listen on which points
    to a wrong nic. I am not sure why so i am not changign this but i am
    struggling to override this at the moment as nothing seems to work
    (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
    possible combinatiosn... it probably has something to do with reverse
    lookup so i added entry to hosts files to no avail so far. i will have
    to talk to our admins to see why we can't switch the canonical host
    name to ip that all the nodes are supposed to use it with .

    thanks.
    -d
    On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov wrote:
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com
  • Sean barden at May 16, 2011 at 7:45 pm
    Dima and I work together. He's got a good amount of opensource
    experience on me and I got pulled away to work on something
    else(MS-SQL issues, no less). He gets all the fun. :). Seriously,
    the issue wouldn't have been solved without him stepping up. thx
    Dima!.


    sean
    On Mon, May 16, 2011 at 1:59 PM, Jean-Daniel Cryans wrote:
    Hey Dmitriy,

    Awesome you could figure it out. I wonder if there's something that
    could be done in HBase to help debugging such problems... Suggestions?

    Also, just to make sure, this thread was started by Sean and it seems
    you stepped up for him... you are working together right? At least
    that's what Rapportive tells me, but still trying to make sure we
    didn't forget someone else's problem.

    Good on you,

    J-D
    On Sun, May 15, 2011 at 12:50 PM, Dmitriy Lyubimov wrote:
    The problem was multinic configuration at master nodes.

    i saw that the processes starts listening on a wrong NIC

    I read the source code and saw that with default settings it would use
    whatever ip is reported by canonical hostname, i.e. whatever retruned
    by something like

    ping `hostname`,


    our canonical hostname was resolving of course  the wrong nic.

    i kind of did not want to edit /etc/hostsnames (i guessed our admins
    had a reason to point hostname to that nic), so i forcefully set
    'eth0' as hbase.master.dns.interface (if i remember that property name
    correctly).

    it started listening on what was pointed by eth0:0 isntead of eth0
    which solved the problem anyway.

    (funny thing though i still couldn't make it listen on eth0 ip, but
    rather on eth0:0 only although both had reverse dns. apparently
    whatever native code is used, lists both ips for that interface and
    then the first one that has reverse dns is used, so there's no way to
    force it to listen on other ones).

    Bottom line, with multinic configurations your hostname better points
    to the ip you want it to listen on in /etc/hosts. If it's different,
    one cannot use the default configuration.

    -d
    On Sat, May 14, 2011 at 3:04 PM, Stack wrote:
    What did you do to solve it?
    Thanks,
    St.Ack
    On Fri, May 13, 2011 at 6:17 PM, Dmitriy Lyubimov wrote:
    Ok i think the issue is largely solved. Thanks for your help, guys.

    -d
    On Fri, May 13, 2011 at 5:32 PM, Dmitriy Lyubimov wrote:
    ok the problem seems to be multi-nic hosting on masters. the hbase
    master starts up and uses canonical hostname to listen on which points
    to a wrong nic. I am not sure why so i am not changign this but i am
    struggling to override this at the moment as nothing seems to work
    (master.dns.interface=eth2, master.dns.server=ip2 ... tried all
    possible combinatiosn... it probably has something to do with reverse
    lookup so i added entry to hosts files to no avail so far. i will have
    to talk to our admins to see why we can't switch the canonical host
    name to ip that all the nodes are supposed to use it with .

    thanks.
    -d
    On Fri, May 13, 2011 at 3:39 PM, Dmitriy Lyubimov wrote:
    Thanks, Jean-Daniel.

    Logs don't show anything abnormal (not even warnings). How soon you
    think the region servers should join?

    I am guessing the sequence should be something along the lines --
    zookeeper needs to timeout old master session first (2 mins or so ) ,
    then hot spare should wean next master election (we probably should
    see that happening if we can tail its log, right?)
    and then the rest of the crowd should join in something like what
    seems to be governed by hbase.regionserver.msginterval property , if i
    read the code correctly?

    So all -in -all probably something like 3 minutes should warrant
    everybody has found the new master one way or another , right? if not,
    we have a problem, right?

    Thanks.
    -Dmitriy

    On Fri, May 13, 2011 at 12:34 PM, Jean-Daniel Cryans
    wrote:
    Maybe there is something else in there, would be useful to see logs
    from the region servers when you are shutting down master 1 and
    bringing up master2.

    About "I have no failover for a critical component of my
    infrastructure.", so is the Namenode, and for the moment you can't do
    much about it. What's usually recommended is to put both the master
    and the NN together on a more reliable machine. And the master ain't
    that critical, almost everything works without it.

    J-D
    On Fri, May 13, 2011 at 12:08 PM, sean barden wrote:
    So I updated one of my clusters from CDHb1 to u0 with no issues(in the
    upgrade).  Hbase failed over to it's "backup" master server just find
    in the older version.  As 0.90.1+15.18, I had hoped the fix would be
    in u0 for the failover issue.  However, I'm having the same issue.
    master1 fails or I shut it down,  master2 waits for RS'es to check in
    forever.  Restarting the services for master2 and all RS's does
    nothing until I start up master1.  So, essentially, I have no failover
    for a critical component of my infrastructure.  Needless to say I'm
    exceptionally frustrated.  Any ideas to a fix or workaround would be
    greatly appreciated.

    Regards,

    Sean
    On Thu, May 5, 2011 at 11:59 AM, Jean-Daniel Cryans wrote:
    Upgrade to CDH3u0 which as far as I can tell has it:
    http://archive.cloudera.com/cdh/3/hbase-0.90.1+15.18.CHANGES.txt

    J-D
    On Thu, May 5, 2011 at 9:55 AM, sean barden wrote:
    Looks like my issue.  We're using 0.90.1-CDH3B4 .  Looks like an
    upgrade is in order.  Can you suggest a workaround?

    thx,

    Sean
    On Thu, May 5, 2011 at 11:49 AM, Jean-Daniel Cryans wrote:
    This sounds like https://issues.apache.org/jira/browse/HBASE-3545
    which was fix in 0.90.2, which version are you testing?

    J-D
    On Thu, May 5, 2011 at 9:23 AM, sean barden wrote:
    Hi,

    I'm testing failing over from one master to another by stopping
    master1(master2 is always running).  Master2 web i/f kicks in and I can
    zk_dump but the region servers never show up.  Logs on master2 show repeated
    entries below:

    2011-05-05 09:10:05,938 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin
    2011-05-05 09:10:07,440 INFO org.apache.hadoop.hbase.master.ServerManager:
    Waiting on regionserver(s) to checkin

    Obviously the RS are not checking in.  Not sure why.

    Any ideas?

    thx,

    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com


    --
    Sean Barden
    sbarden@gmail.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMay 5, '11 at 4:23p
activeMay 16, '11 at 7:45p
posts15
users5
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase