FAQ
I replaced hbase jar with hbase-0.90.1.jar
I also upgraded client side jar to hbase-0.90.1.jar

Our map tasks were running faster than before for about 50 minutes. However,
map tasks then timed out calling flushCommits(). This happened even after
fresh restart of hbase.

I don't see any exception in region server logs.

In master log, I found:

2011-02-10 18:24:15,286 DEBUG
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
-ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
2011-02-10 18:24:15,349 INFO org.apache.hadoop.hbase.catalog.CatalogTracker:
Failed verification of .META.,,1 at address=null;
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
.META.,,1
2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node for
1028785192 with OFFLINE state

I am attaching region server (which didn't respond to stop-hbase.sh) jstack.

FYI
On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra
config. to enable the experimental HBASE-3455 feature but you can copy
that over if you want to try playing with it (it defaults off so you'd
copy over the config. if you wanted to set it to true).

St.Ack

Search Discussions

  • Ted Yu at Feb 10, 2011 at 10:44 pm
    FYI I use cdh3b2. I put hadoop jar from cdh3b2 into $HBASE_HOME/lib
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:

    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes.
    However, map tasks then timed out calling flushCommits(). This happened even
    after fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
    .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh)
    jstack.

    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can copy
    that over if you want to try playing with it (it defaults off so you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Stack at Feb 10, 2011 at 11:04 pm
    So, .META. is not online? What happens if you use shell at this time.

    Your attachement did not come across Ted. Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes. However,
    map tasks then timed out calling flushCommits(). This happened even after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh) jstack.

    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted.  The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can copy
    that over if you want to try playing with it (it defaults off so you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ted Yu at Feb 10, 2011 at 11:13 pm
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online? What happens if you use shell at this time.

    Your attachement did not come across Ted. Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes. However,
    map tasks then timed out calling flushCommits(). This happened even after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh) jstack.
    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can copy
    that over if you want to try playing with it (it defaults off so you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ryan Rawson at Feb 10, 2011 at 11:17 pm
    You don't have both the old and the new hbase jars in there do you?

    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online?  What happens if you use shell at this time.

    Your attachement did not come across Ted.  Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes. However,
    map tasks then timed out calling flushCommits(). This happened even after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh) jstack.
    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted.  The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can copy
    that over if you want to try playing with it (it defaults off so you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ted Yu at Feb 10, 2011 at 11:26 pm
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?
    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson wrote:

    You don't have both the old and the new hbase jars in there do you?

    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online? What happens if you use shell at this time.

    Your attachement did not come across Ted. Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes. However,
    map tasks then timed out calling flushCommits(). This happened even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh) jstack.
    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can
    copy
    that over if you want to try playing with it (it defaults off so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ryan Rawson at Feb 10, 2011 at 11:40 pm
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?
    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson wrote:

    You don't have both the old and the new hbase jars in there do you?

    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online?  What happens if you use shell at this time.

    Your attachement did not come across Ted.  Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes. However,
    map tasks then timed out calling flushCommits(). This happened even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh) jstack.
    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted.  The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can
    copy
    that over if you want to try playing with it (it defaults off so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ted Yu at Feb 11, 2011 at 12:21 am
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why there wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when the two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.

    Cheers
    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson wrote:

    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?
    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson wrote:

    You don't have both the old and the new hbase jars in there do you?

    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online? What happens if you use shell at this
    time.
    Your attachement did not come across Ted. Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes. However,
    map tasks then timed out calling flushCommits(). This happened even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted. The 0.90.1 hbase-default.xml has an
    extra
    config. to enable the experimental HBASE-3455 feature but you can
    copy
    that over if you want to try playing with it (it defaults off so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ryan Rawson at Feb 11, 2011 at 12:25 am
    As I suspected.

    It's a byproduct of our maven assembly process. The process could be
    fixed. I wouldn't mind. I don't support runtime checking of jars,
    there is such thing as too much tests, and this is an example of it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the config
    over. With the lack of -default.xml this is easier than ever (just
    copy everything in conf/). With symlink switchover it makes roll
    forward/back as simple as doing a symlink switchover or back. I have
    to recommend this to everyone who doesnt have a management scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why there wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when the two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.

    Cheers
    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson wrote:

    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?
    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson wrote:

    You don't have both the old and the new hbase jars in there do you?

    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online?  What happens if you use shell at this
    time.
    Your attachement did not come across Ted.  Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes. However,
    map tasks then timed out calling flushCommits(). This happened even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted.  The 0.90.1 hbase-default.xml has an
    extra
    config. to enable the experimental HBASE-3455 feature but you can
    copy
    that over if you want to try playing with it (it defaults off so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ted Yu at Feb 11, 2011 at 12:34 am
    Can someone comment on my second question ?
    Thanks
    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson wrote:

    As I suspected.

    It's a byproduct of our maven assembly process. The process could be
    fixed. I wouldn't mind. I don't support runtime checking of jars,
    there is such thing as too much tests, and this is an example of it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the config
    over. With the lack of -default.xml this is easier than ever (just
    copy everything in conf/). With symlink switchover it makes roll
    forward/back as simple as doing a symlink switchover or back. I have
    to recommend this to everyone who doesnt have a management scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why there wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when the two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.
    Cheers
    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson wrote:

    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?
    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson wrote:

    You don't have both the old and the new hbase jars in there do you?

    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online? What happens if you use shell at this
    time.
    Your attachement did not come across Ted. Mind postbin'ing it?

    St.Ack
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50
    minutes.
    However,
    map tasks then timed out calling flushCommits(). This happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted. The 0.90.1 hbase-default.xml has an
    extra
    config. to enable the experimental HBASE-3455 feature but you
    can
    copy
    that over if you want to try playing with it (it defaults off
    so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ryan Rawson at Feb 11, 2011 at 12:43 am
    It's a standard linking issue, you get one class from one version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible in
    different ways. In the stack trace you posted, the handlers were
    blocked in:

    at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)

    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's doing
    nothing. This also happens to be one of the classes that were
    changed.


    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu wrote:
    Can someone comment on my second question ?
    Thanks
    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson wrote:

    As I suspected.

    It's a byproduct of our maven assembly process. The process could be
    fixed. I wouldn't mind. I don't support runtime checking of jars,
    there is such thing as too much tests, and this is an example of it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the config
    over. With the lack of -default.xml this is easier than ever (just
    copy everything in conf/).  With symlink switchover it makes roll
    forward/back as simple as doing a symlink switchover or back. I have
    to recommend this to everyone who doesnt have a management scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why there wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when the two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.
    Cheers
    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson wrote:

    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    You don't have both the old and the new hbase jars in there do you?

    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts. HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online?  What happens if you use shell at this
    time.
    Your attachement did not come across Ted.  Mind postbin'ing it?

    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50
    minutes.
    However,
    map tasks then timed out calling flushCommits(). This happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack <stack@duboce.net>
    wrote:
    Thats probably enough Ted.  The 0.90.1 hbase-default.xml has an
    extra
    config. to enable the experimental HBASE-3455 feature but you
    can
    copy
    that over if you want to try playing with it (it defaults off
    so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ted Yu at Feb 11, 2011 at 12:54 am
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this situation develop
    after 40 minutes of heavy load :-(
    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson wrote:

    It's a standard linking issue, you get one class from one version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible in
    different ways. In the stack trace you posted, the handlers were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)

    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's doing
    nothing. This also happens to be one of the classes that were
    changed.


    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu wrote:
    Can someone comment on my second question ?
    Thanks
    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson wrote:

    As I suspected.

    It's a byproduct of our maven assembly process. The process could be
    fixed. I wouldn't mind. I don't support runtime checking of jars,
    there is such thing as too much tests, and this is an example of it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the config
    over. With the lack of -default.xml this is easier than ever (just
    copy everything in conf/). With symlink switchover it makes roll
    forward/back as simple as doing a symlink switchover or back. I have
    to recommend this to everyone who doesnt have a management scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.
    Cheers
    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson wrote:

    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    You don't have both the old and the new hbase jars in there do
    you?
    -ryan
    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt
    On Thu, Feb 10, 2011 at 3:04 PM, Stack wrote:

    So, .META. is not online? What happens if you use shell at
    this
    time.
    Your attachement did not come across Ted. Mind postbin'ing it?

    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50
    minutes.
    However,
    map tasks then timed out calling flushCommits(). This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1 hbase-default.xml has
    an
    extra
    config. to enable the experimental HBASE-3455 feature but
    you
    can
    copy
    that over if you want to try playing with it (it defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Todd Lipcon at Feb 11, 2011 at 1:06 am

    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu wrote:

    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this situation develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of load.

    -Todd
    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson wrote:

    It's a standard linking issue, you get one class from one version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible in
    different ways. In the stack trace you posted, the handlers were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's doing
    nothing. This also happens to be one of the classes that were
    changed.


    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu wrote:
    Can someone comment on my second question ?
    Thanks
    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson wrote:

    As I suspected.

    It's a byproduct of our maven assembly process. The process could be
    fixed. I wouldn't mind. I don't support runtime checking of jars,
    there is such thing as too much tests, and this is an example of it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the config
    over. With the lack of -default.xml this is easier than ever (just
    copy everything in conf/). With symlink switchover it makes roll
    forward/back as simple as doing a symlink switchover or back. I have
    to recommend this to everyone who doesnt have a management scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars in there do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you use shell at
    this
    time.
    Your attachement did not come across Ted. Mind postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50
    minutes.
    However,
    map tasks then timed out calling flushCommits(). This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1 hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455 feature but
    you
    can
    copy
    that over if you want to try playing with it (it defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Feb 13, 2011 at 4:45 pm
    I had 3 consecutive successful runs processing 200GB data for each run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master exited after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
    for including individual email addresses as attachments wouldn't go through
    hbase.apache.org
    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu wrote:

    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this situation develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of load.

    -Todd
    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson wrote:

    It's a standard linking issue, you get one class from one version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible in
    different ways. In the stack trace you posted, the handlers were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's doing
    nothing. This also happens to be one of the classes that were
    changed.


    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The process could be
    fixed. I wouldn't mind. I don't support runtime checking of jars,
    there is such thing as too much tests, and this is an example of it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the config
    over. With the lack of -default.xml this is easier than ever (just
    copy everything in conf/). With symlink switchover it makes roll
    forward/back as simple as doing a symlink switchover or back. I have
    to recommend this to everyone who doesnt have a management scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson <
    ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars in there do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you use shell at
    this
    time.
    Your attachement did not come across Ted. Mind postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about
    50
    minutes.
    However,
    map tasks then timed out calling flushCommits(). This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack <
    stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1 hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455 feature
    but
    you
    can
    copy
    that over if you want to try playing with it (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ryan Rawson at Feb 13, 2011 at 8:24 pm
    every handler thread, and every reader and also the accept thread are
    all blocked on flushing memstore. The handlers get blocked, then the
    readers also have a finite handoff queue and they are blocked and also
    the accept.

    But why isnt memstore flushing? Do you have regionserver stats? ie:
    how much memstore global ram used? That is found on the main page of
    the regionserver http service, also found in ganglia/file stats.

    I havent looked at the logs yet, I'm off to lunch now.

    -ryan
    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu wrote:
    I had 3 consecutive successful runs processing 200GB data for each run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master exited after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
    for including individual email addresses as attachments wouldn't go through
    hbase.apache.org
    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu wrote:

    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of load.

    -Todd
    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson wrote:

    It's a standard linking issue, you get one class from one version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible in
    different ways. In the stack trace you posted, the handlers were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's doing
    nothing.  This also happens to be one of the classes that were
    changed.


    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The process could
    be
    fixed. I wouldn't mind. I don't support runtime checking of jars,
    there is such thing as too much tests, and this is an example of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the config
    over. With the lack of -default.xml this is easier than ever (just
    copy everything in conf/).  With symlink switchover it makes roll
    forward/back as simple as doing a symlink switchover or back. I
    have
    to recommend this to everyone who doesnt have a management scheme.

    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6 didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars in there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <stack@duboce.net>
    wrote:
    So, .META. is not online?  What happens if you use shell
    at
    this
    time.
    Your attachement did not come across Ted.  Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about
    50
    minutes.
    However,
    map tasks then timed out calling flushCommits(). This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG

    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted.  The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455 feature
    but
    you
    can
    copy
    that over if you want to try playing with it (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it to true).

    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Feb 14, 2011 at 5:20 pm
    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa

    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu wrote:

    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
    FYI

    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu wrote:

    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983, blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712, blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0, blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks

    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson wrote:

    every handler thread, and every reader and also the accept thread are
    all blocked on flushing memstore. The handlers get blocked, then the
    readers also have a finite handoff queue and they are blocked and also
    the accept.

    But why isnt memstore flushing? Do you have regionserver stats? ie:
    how much memstore global ram used? That is found on the main page of
    the regionserver http service, also found in ganglia/file stats.

    I havent looked at the logs yet, I'm off to lunch now.

    -ryan
    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu wrote:
    I had 3 consecutive successful runs processing 200GB data for each run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master exited after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
    for including individual email addresses as attachments wouldn't go through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu wrote:

    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of load.

    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from one version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible in
    different ways. In the stack trace you posted, the handlers were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's doing
    nothing. This also happens to be one of the classes that were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The process
    could
    be
    fixed. I wouldn't mind. I don't support runtime checking of
    jars,
    there is such thing as too much tests, and this is an example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the
    config
    over. With the lack of -default.xml this is easier than ever
    (just
    copy everything in conf/). With symlink switchover it makes
    roll
    forward/back as simple as doing a symlink switchover or back.
    I
    have
    to recommend this to everyone who doesnt have a management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you use
    shell
    at
    this
    time.
    Your attachement did not come across Ted. Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for
    about
    50
    minutes.
    However,
    map tasks then timed out calling flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with it (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Feb 14, 2011 at 6:00 pm

    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.
    MSLAB is already disabled by default in 0.90.1. If it's not, I screwed up
    the commits and we should fix that.

    -Todd

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa

    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu wrote:

    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
    FYI

    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu wrote:

    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983, blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept thread are
    all blocked on flushing memstore. The handlers get blocked, then the
    readers also have a finite handoff queue and they are blocked and also
    the accept.

    But why isnt memstore flushing? Do you have regionserver stats? ie:
    how much memstore global ram used? That is found on the main page of
    the regionserver http service, also found in ganglia/file stats.

    I havent looked at the logs yet, I'm off to lunch now.

    -ryan
    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu wrote:
    I had 3 consecutive successful runs processing 200GB data for each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master exited after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
    for including individual email addresses as attachments wouldn't go through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu wrote:

    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of load.

    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from one
    version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible in
    different ways. In the stack trace you posted, the handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's doing
    nothing. This also happens to be one of the classes that were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The process
    could
    be
    fixed. I wouldn't mind. I don't support runtime checking of
    jars,
    there is such thing as too much tests, and this is an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the
    config
    over. With the lack of -default.xml this is easier than ever
    (just
    copy everything in conf/). With symlink switchover it makes
    roll
    forward/back as simple as doing a symlink switchover or
    back.
    I
    have
    to recommend this to everyone who doesnt have a management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you use
    shell
    at
    this
    time.
    Your attachement did not come across Ted. Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than before for
    about
    50
    minutes.
    However,
    map tasks then timed out calling flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with it (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Feb 13, 2011 at 5:00 pm
    BTW
    The timeout (when calling flushCommits) happened midnight, so I didn't
    capture jstack.

    In hadoop1 region server log, I see this around time of timeout in 4th run:

    2011-02-13 08:25:01,015 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
    Finished snapshotting, commencing flushing stores
    2011-02-13 08:25:01,016 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
    Responder, call flushRegion(REGION => {NAME =>
    'NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297583809865,2>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&T,1297583814638.8cb772d452dee232306dfab0b472ec9a.',
    STARTKEY => '2>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&T',
    ENDKEY =>
    '2\xC1\xA3\xDFhVz2\xC1\xA3\xDFhVz2\xC1\xA3\xDFhVz2\xC1\xA3\xDFhVz2\xC1\xA3\xDD',
    ENCODED => 8cb772d452dee232306dfab0b472ec9a, TABLE => {{NAME =>
    'NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297583809865', FAMILIES =>
    [{NAME => 'd', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS =>
    '2', COMPRESSION => 'GZ', TTL => '31536000', BLOCKSIZE => '65536', IN_MEMORY
    => 'false', BLOCKCACHE => 'false'}, {NAME => 'i', BLOOMFILTER => 'ROW',
    REPLICATION_SCOPE => '0', VERSIONS => '2', COMPRESSION => 'GZ', TTL =>
    '31536000', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
    'false'}, {NAME => 'v', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
    VERSIONS => '2', COMPRESSION => 'GZ', TTL => '31536000', BLOCKSIZE =>
    '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}) from
    10.202.50.76:62489: output error
    2011-02-13 08:25:01,020 WARN org.apache.hadoop.ipc.HBaseServer: PRI IPC
    Server handler 3 on 60020 caught: java.nio.channels.ClosedChannelException
    at
    sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1339)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:

    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes.
    However, map tasks then timed out calling flushCommits(). This happened even
    after fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
    .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh)
    jstack.

    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can copy
    that over if you want to try playing with it (it defaults off so you'd
    copy over the config. if you wanted to set it to true).

    St.Ack
  • Ted Yu at Feb 13, 2011 at 5:10 pm
    Here is partial config I used:
    http://pastebin.com/1Dpbb2LA

    I verified that there is no hbase-0.90.1.jar in lib dir.

    Thanks
    On Sun, Feb 13, 2011 at 8:59 AM, Ted Yu wrote:

    BTW
    The timeout (when calling flushCommits) happened midnight, so I didn't
    capture jstack.

    In hadoop1 region server log, I see this around time of timeout in 4th run:

    2011-02-13 08:25:01,015 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
    Finished snapshotting, commencing flushing stores
    2011-02-13 08:25:01,016 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
    Responder, call flushRegion(REGION => {NAME =>
    'NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297583809865,2>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&T,1297583814638.8cb772d452dee232306dfab0b472ec9a.',
    STARTKEY => '2>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&U\xF6\xB582>&T',
    ENDKEY =>
    '2\xC1\xA3\xDFhVz2\xC1\xA3\xDFhVz2\xC1\xA3\xDFhVz2\xC1\xA3\xDFhVz2\xC1\xA3\xDD',
    ENCODED => 8cb772d452dee232306dfab0b472ec9a, TABLE => {{NAME =>
    'NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297583809865', FAMILIES =>
    [{NAME => 'd', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS =>
    '2', COMPRESSION => 'GZ', TTL => '31536000', BLOCKSIZE => '65536', IN_MEMORY
    => 'false', BLOCKCACHE => 'false'}, {NAME => 'i', BLOOMFILTER => 'ROW',
    REPLICATION_SCOPE => '0', VERSIONS => '2', COMPRESSION => 'GZ', TTL =>
    '31536000', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
    'false'}, {NAME => 'v', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
    VERSIONS => '2', COMPRESSION => 'GZ', TTL => '31536000', BLOCKSIZE =>
    '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}) from
    10.202.50.76:62489: output error
    2011-02-13 08:25:01,020 WARN org.apache.hadoop.ipc.HBaseServer: PRI IPC
    Server handler 3 on 60020 caught: java.nio.channels.ClosedChannelException
    at
    sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1339)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792)
    at
    org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083)
    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu wrote:

    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to hbase-0.90.1.jar

    Our map tasks were running faster than before for about 50 minutes.
    However, map tasks then timed out calling flushCommits(). This happened even
    after fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
    .META.,,1 at address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException: Region is not online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or updating) unassigned node for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't respond to stop-hbase.sh)
    jstack.

    FYI
    On Thu, Feb 10, 2011 at 10:10 AM, Stack wrote:

    Thats probably enough Ted. The 0.90.1 hbase-default.xml has an extra
    config. to enable the experimental HBASE-3455 feature but you can copy
    that over if you want to try playing with it (it defaults off so you'd
    copy over the config. if you wanted to set it to true).

    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedFeb 10, '11 at 10:42p
activeFeb 14, '11 at 6:00p
posts19
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase