FAQ
Here is stack trace for one region server which didn't cleanly shutdown:
http://pastebin.com/PEtEdi4g

I noticed IPC Server handler 7 was holding lock in reclaimMemStoreMemory().

Here is related snippet from the region server's log:
http://pastebin.com/KeXppURX

I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
splitting.

Please advise whether there could be any relation between the above snippet
and lock up in MemStoreFlusher.

Thanks
On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

I disabled MSLAB.
My flow still couldn't make much progress.

In this region server stack trace, I don't see
MemStoreFlusher.reclaimMemStoreMemory() call:
http://pastebin.com/uiBRidUa

On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu wrote:

I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
FYI

On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu wrote:

Since master server shut down, I restarted the cluster.
The next flow over 200GB data got timed out.

Here are some region server stat:

request=0.0, regions=95, stores=213, storefiles=65,
storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968,
blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0,
blockCacheHitCachingRatio=0

request=0.0, regions=95, stores=232, storefiles=72,
storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
flushQueueSize=0, usedHeap=1740, maxHeap=3983, blockCacheSize=13110928,
blockCacheFree=822263344, blockCacheCount=712, blockCacheHitCount=112478,
blockCacheMissCount=712, blockCacheEvictedCount=0, blockCacheHitRatio=99,
blockCacheHitCachingRatio=99

Thanks

On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson wrote:

every handler thread, and every reader and also the accept thread are
all blocked on flushing memstore. The handlers get blocked, then the
readers also have a finite handoff queue and they are blocked and also
the accept.

But why isnt memstore flushing? Do you have regionserver stats? ie:
how much memstore global ram used? That is found on the main page of
the regionserver http service, also found in ganglia/file stats.

I havent looked at the logs yet, I'm off to lunch now.

-ryan
On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu wrote:
I had 3 consecutive successful runs processing 200GB data for each run
before hitting timeout problem in the 4th run.

The 5th run couldn't proceed because master complained:

2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: Failed
assignment of regions to
serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

but sjc1-hadoop6.sjc1 claimed:
2011-02-13 16:13:32,258 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will
retry

Here is stack trace for sjc1-hadoop6.sjc1:
http://pastebin.com/X8zWLXqu

I didn't have chance to capture master stack trace as master exited after
that.

I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
for including individual email addresses as attachments wouldn't go through
hbase.apache.org

On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
wrote:
On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
Thanks for the explanation.
Assuming the mixed class loading is static, why did this situation
develop
after 40 minutes of heavy load :-(
You didn't hit global memstore pressure until 40 minutes of load.

-Todd

On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com>
wrote:
It's a standard linking issue, you get one class from one
version
another from another, they are mostly compatible in terms of
signatures (hence no exceptions) but are subtly incompatible in
different ways. In the stack trace you posted, the handlers were
blocked in:

at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
and the thread:

"regionserver60020.cacheFlusher" daemon prio=10
tid=0x00002aaabc21e000
nid=0x7717 waiting for monitor entry [0x0000000000000000]
java.lang.Thread.State: BLOCKED (on object monitor)

was idle.

The cache flusher thread should be flushing, and yet it's doing
nothing. This also happens to be one of the classes that were
changed.



On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
Can someone comment on my second question ?
Thanks

On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
ryanobjc@gmail.com>
wrote:
As I suspected.

It's a byproduct of our maven assembly process. The process
could
be
fixed. I wouldn't mind. I don't support runtime checking of
jars,
there is such thing as too much tests, and this is an example
of
it.
The check would then need a test, etc, etc.

At SU we use new directories for each upgrade, copying the
config
over. With the lack of -default.xml this is easier than ever
(just
copy everything in conf/). With symlink switchover it makes
roll
forward/back as simple as doing a symlink switchover or back.
I
have
to recommend this to everyone who doesnt have a management
scheme.
On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <yuzhihong@gmail.com
wrote:
hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
classpath.
I wonder
1. why hbase jar is placed in two directories - 0.20.6
didn't use
such
structure
2. what from lib/hbase-0.90.0.jar could be picked up and
why
there
wasn't
exception in server log

I think a JIRA should be filed for item 2 above - bail out
when
the
two
hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of
different
versions.
Cheers

On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
ryanobjc@gmail.com>
wrote:
What do you get when you:

ls lib/hbase*

I'm going to guess there is hbase-0.90.0.jar there



On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
yuzhihong@gmail.com>
wrote:
hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
Would this be a problem ?

On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
<ryanobjc@gmail.com
wrote:
You don't have both the old and the new hbase jars in
there
do
you?
-ryan

On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
yuzhihong@gmail.com>
wrote:
.META. went offline during second flow attempt.

The time out I mentioned happened for 1st and 3rd
attempts.
HBase
was
restarted before the 1st and 3rd attempts.

Here is jstack:
http://pastebin.com/EHMSvsRt

On Thu, Feb 10, 2011 at 3:04 PM, Stack <
stack@duboce.net>
wrote:
So, .META. is not online? What happens if you use
shell
at
this
time.
Your attachement did not come across Ted. Mind
postbin'ing
it?
St.Ack

On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
<yuzhihong@gmail.com
wrote:
I replaced hbase jar with hbase-0.90.1.jar
I also upgraded client side jar to
hbase-0.90.1.jar
Our map tasks were running faster than before for
about
50
minutes.
However,
map tasks then timed out calling flushCommits().
This
happened
even
after
fresh restart of hbase.

I don't see any exception in region server logs.

In master log, I found:

2011-02-10 18:24:15,286 DEBUG
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Opened
region
-ROOT-,,0.70236052 on sjc1-hadoop6.X.com
,60020,1297362251595
2011-02-10 18:24:15,349 INFO
org.apache.hadoop.hbase.catalog.CatalogTracker:
Failed verification of .META.,,1 at address=null;
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
Region
is
not
online:
.META.,,1
2011-02-10 18:24:15,350 DEBUG
org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x12e10d0e31e0000 Creating (or
updating)
unassigned
node
for
1028785192 with OFFLINE state

I am attaching region server (which didn't respond
to
stop-hbase.sh)
jstack.
FYI

On Thu, Feb 10, 2011 at 10:10 AM, Stack
<stack@duboce.net>
wrote:
Thats probably enough Ted. The 0.90.1
hbase-default.xml
has
an
extra
config. to enable the experimental HBASE-3455
feature
but
you
can
copy
that over if you want to try playing with it (it
defaults
off
so
you'd
copy over the config. if you wanted to set it to
true).
St.Ack


--
Todd Lipcon
Software Engineer, Cloudera

Search Discussions

  • Todd Lipcon at Feb 14, 2011 at 7:31 pm
    Good find, Ted. looking into this...
    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu wrote:

    Here is stack trace for one region server which didn't cleanly shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in reclaimMemStoreMemory().

    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
    splitting.

    Please advise whether there could be any relation between the above snippet
    and lock up in MemStoreFlusher.

    Thanks
    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa

    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu wrote:

    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
    FYI

    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu wrote:

    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept thread are
    all blocked on flushing memstore. The handlers get blocked, then the
    readers also have a finite handoff queue and they are blocked and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver stats? ie:
    how much memstore global ram used? That is found on the main page of
    the regionserver http service, also found in ganglia/file stats.

    I havent looked at the logs yet, I'm off to lunch now.

    -ryan
    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu wrote:
    I had 3 consecutive successful runs processing 200GB data for each run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master exited after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
    for including individual email addresses as attachments wouldn't go through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of load.

    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from one
    version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible
    in
    different ways. In the stack trace you posted, the handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's
    doing
    nothing. This also happens to be one of the classes that were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The process
    could
    be
    fixed. I wouldn't mind. I don't support runtime checking of
    jars,
    there is such thing as too much tests, and this is an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the
    config
    over. With the lack of -default.xml this is easier than
    ever
    (just
    copy everything in conf/). With symlink switchover it
    makes
    roll
    forward/back as simple as doing a symlink switchover or
    back.
    I
    have
    to recommend this to everyone who doesnt have a management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you use
    shell
    at
    this
    time.
    Your attachement did not come across Ted. Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Feb 14, 2011 at 9:00 pm
    Hi Ted,

    What was the reason for the region server shutdown? ie was it aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd
    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu wrote:

    Here is stack trace for one region server which didn't cleanly shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in reclaimMemStoreMemory().

    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
    splitting.

    Please advise whether there could be any relation between the above snippet
    and lock up in MemStoreFlusher.

    Thanks
    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa

    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu wrote:

    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
    FYI

    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu wrote:

    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept thread are
    all blocked on flushing memstore. The handlers get blocked, then the
    readers also have a finite handoff queue and they are blocked and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver stats? ie:
    how much memstore global ram used? That is found on the main page of
    the regionserver http service, also found in ganglia/file stats.

    I havent looked at the logs yet, I'm off to lunch now.

    -ryan
    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu wrote:
    I had 3 consecutive successful runs processing 200GB data for each run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master exited after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
    for including individual email addresses as attachments wouldn't go through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of load.

    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from one
    version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible
    in
    different ways. In the stack trace you posted, the handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's
    doing
    nothing. This also happens to be one of the classes that were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The process
    could
    be
    fixed. I wouldn't mind. I don't support runtime checking of
    jars,
    there is such thing as too much tests, and this is an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying the
    config
    over. With the lack of -default.xml this is easier than
    ever
    (just
    copy everything in conf/). With symlink switchover it
    makes
    roll
    forward/back as simple as doing a symlink switchover or
    back.
    I
    have
    to recommend this to everyone who doesnt have a management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you use
    shell
    at
    this
    time.
    Your attachement did not come across Ted. Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server logs.

    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Feb 14, 2011 at 9:08 pm
    The lock up happened before I shut down region server.
    I had to get out of that situation so that I can continue 0.90.1 validation.
    On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon wrote:

    Hi Ted,

    What was the reason for the region server shutdown? ie was it aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd
    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu wrote:

    Here is stack trace for one region server which didn't cleanly shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in
    reclaimMemStoreMemory().
    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
    splitting.

    Please advise whether there could be any relation between the above snippet
    and lock up in MemStoreFlusher.

    Thanks
    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa

    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu wrote:

    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
    FYI

    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu wrote:

    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983,
    blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0,
    blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept thread
    are
    all blocked on flushing memstore. The handlers get blocked, then
    the
    readers also have a finite handoff queue and they are blocked and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver stats?
    ie:
    how much memstore global ram used? That is found on the main page
    of
    the regionserver http service, also found in ganglia/file stats.

    I havent looked at the logs yet, I'm off to lunch now.

    -ryan
    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu wrote:
    I had 3 consecutive successful runs processing 200GB data for
    each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master
    exited
    after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
    for including individual email addresses as attachments wouldn't
    go
    through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of
    load.
    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from one
    version
    another from another, they are mostly compatible in terms of
    signatures (hence no exceptions) but are subtly incompatible
    in
    different ways. In the stack trace you posted, the handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's
    doing
    nothing. This also happens to be one of the classes that
    were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The
    process
    could
    be
    fixed. I wouldn't mind. I don't support runtime checking
    of
    jars,
    there is such thing as too much tests, and this is an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying
    the
    config
    over. With the lack of -default.xml this is easier than
    ever
    (just
    copy everything in conf/). With symlink switchover it
    makes
    roll
    forward/back as simple as doing a symlink switchover or
    back.
    I
    have
    to recommend this to everyone who doesnt have a
    management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in
    the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories - 0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up
    and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above - bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase jars
    in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and 3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you
    use
    shell
    at
    this
    time.
    Your attachement did not come across Ted. Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling
    flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server
    logs.
    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Feb 14, 2011 at 9:37 pm
    Gotcha, OK. I think I understand what's going on then.
    I'm guessing you have a TON of log messages like this?

    2011-02-14 06:27:23,116 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
    NOT flushing memstore for region
    NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3.,
    flushing=false, writesEnabled=false


    -Todd
    On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu wrote:

    The lock up happened before I shut down region server.
    I had to get out of that situation so that I can continue 0.90.1
    validation.
    On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon wrote:

    Hi Ted,

    What was the reason for the region server shutdown? ie was it aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd
    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu wrote:

    Here is stack trace for one region server which didn't cleanly
    shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in
    reclaimMemStoreMemory().
    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
    splitting.

    Please advise whether there could be any relation between the above snippet
    and lock up in MemStoreFlusher.

    Thanks
    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa

    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu wrote:

    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
    FYI

    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu wrote:

    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983,
    blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0,
    blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept thread
    are
    all blocked on flushing memstore. The handlers get blocked, then
    the
    readers also have a finite handoff queue and they are blocked and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver stats?
    ie:
    how much memstore global ram used? That is found on the main
    page
    of
    the regionserver http service, also found in ganglia/file stats.

    I havent looked at the logs yet, I'm off to lunch now.

    -ryan

    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I had 3 consecutive successful runs processing 200GB data for
    each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com
    ,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master
    exited
    after
    that.

    I also attach master and region server log on sjc1-hadoop6.sjc1
    -
    pardon me
    for including individual email addresses as attachments
    wouldn't
    go
    through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <
    todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of
    load.
    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from one
    version
    another from another, they are mostly compatible in terms
    of
    signatures (hence no exceptions) but are subtly
    incompatible
    in
    different ways. In the stack trace you posted, the
    handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet it's
    doing
    nothing. This also happens to be one of the classes that
    were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The
    process
    could
    be
    fixed. I wouldn't mind. I don't support runtime
    checking
    of
    jars,
    there is such thing as too much tests, and this is an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade, copying
    the
    config
    over. With the lack of -default.xml this is easier than
    ever
    (just
    copy everything in conf/). With symlink switchover it
    makes
    roll
    forward/back as simple as doing a symlink switchover or
    back.
    I
    have
    to recommend this to everyone who doesnt have a
    management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in
    the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories -
    0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked up
    and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above -
    bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are
    of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar
    co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase
    jars
    in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow attempt.

    The time out I mentioned happened for 1st and
    3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if you
    use
    shell
    at
    this
    time.
    Your attachement did not come across Ted.
    Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than
    before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling
    flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server
    logs.
    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental
    HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with
    it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to set
    it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Feb 14, 2011 at 9:49 pm
    You're right.
    On Mon, Feb 14, 2011 at 1:12 PM, Todd Lipcon wrote:

    Gotcha, OK. I think I understand what's going on then.
    I'm guessing you have a TON of log messages like this?

    2011-02-14 06:27:23,116 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
    NOT flushing memstore for region

    NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3.,
    flushing=false, writesEnabled=false


    -Todd
    On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu wrote:

    The lock up happened before I shut down region server.
    I had to get out of that situation so that I can continue 0.90.1
    validation.
    On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon wrote:

    Hi Ted,

    What was the reason for the region server shutdown? ie was it aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd
    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu wrote:

    Here is stack trace for one region server which didn't cleanly
    shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in
    reclaimMemStoreMemory().
    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
    splitting.

    Please advise whether there could be any relation between the above snippet
    and lock up in MemStoreFlusher.

    Thanks
    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa


    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's
    repo.
    FYI


    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983,
    blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0,
    blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0,
    blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <
    ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept
    thread
    are
    all blocked on flushing memstore. The handlers get blocked,
    then
    the
    readers also have a finite handoff queue and they are blocked
    and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver
    stats?
    ie:
    how much memstore global ram used? That is found on the main
    page
    of
    the regionserver http service, also found in ganglia/file
    stats.
    I havent looked at the logs yet, I'm off to lunch now.

    -ryan

    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I had 3 consecutive successful runs processing 200GB data for
    each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com
    ,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master
    exited
    after
    that.

    I also attach master and region server log on
    sjc1-hadoop6.sjc1
    -
    pardon me
    for including individual email addresses as attachments
    wouldn't
    go
    through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <
    todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of
    load.
    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from
    one
    version
    another from another, they are mostly compatible in
    terms
    of
    signatures (hence no exceptions) but are subtly
    incompatible
    in
    different ways. In the stack trace you posted, the
    handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry
    [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet
    it's
    doing
    nothing. This also happens to be one of the classes
    that
    were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The
    process
    could
    be
    fixed. I wouldn't mind. I don't support runtime
    checking
    of
    jars,
    there is such thing as too much tests, and this is an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade,
    copying
    the
    config
    over. With the lack of -default.xml this is easier
    than
    ever
    (just
    copy everything in conf/). With symlink switchover
    it
    makes
    roll
    forward/back as simple as doing a symlink switchover
    or
    back.
    I
    have
    to recommend this to everyone who doesnt have a
    management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar
    in
    the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories -
    0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked
    up
    and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above -
    bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are
    of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar
    co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase
    jars
    in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow
    attempt.
    The time out I mentioned happened for 1st and
    3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if
    you
    use
    shell
    at
    this
    time.
    Your attachement did not come across Ted.
    Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than
    before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling
    flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server
    logs.
    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating
    (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental
    HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with
    it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to
    set
    it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Feb 15, 2011 at 1:18 am
    I applied Todd's patch from HBASE-3531 locally and built hbase jar.

    I restarted cluster. The first flow over 200GB data succeeded.
    The second flow got stuck. So I tried to shutdown the cluster.

    Here is stack trace for one of the region servers that refused to go down:
    http://pastebin.com/yJQhhYp8

    This is particularly interesting (other threads were waiting to lock
    0x000000070be8a028):

    1. "IPC Server handler 5 on 60020" daemon prio=10 tid=0x00002aaab883c000
    nid=0x32bb waiting on condition [0x0000000043f46000]
    2. java.lang.Thread.State: WAITING (parking)
    3. at sun.misc.Unsafe.park(Native Method)
    4. - parking to wait for <0x000000070ca468a8> (a
    java.util.concurrent.locks.ReentrantLock$NonfairSync)
    5. at
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    6. at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    7. at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
    8. at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
    9. at
    java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
    10. at
    java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
    11. at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:429)
    12. - locked <0x000000070be8a028> (a
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher)
    13. at
    org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2562)

    Thanks for your attention.
    On Mon, Feb 14, 2011 at 1:12 PM, Todd Lipcon wrote:

    Gotcha, OK. I think I understand what's going on then.
    I'm guessing you have a TON of log messages like this?

    2011-02-14 06:27:23,116 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
    NOT flushing memstore for region

    NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3.,
    flushing=false, writesEnabled=false


    -Todd
    On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu wrote:

    The lock up happened before I shut down region server.
    I had to get out of that situation so that I can continue 0.90.1
    validation.
    On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon wrote:

    Hi Ted,

    What was the reason for the region server shutdown? ie was it aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd
    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu wrote:

    Here is stack trace for one region server which didn't cleanly
    shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in
    reclaimMemStoreMemory().
    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
    splitting.

    Please advise whether there could be any relation between the above snippet
    and lock up in MemStoreFlusher.

    Thanks
    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa


    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's
    repo.
    FYI


    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983,
    blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0,
    blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0,
    blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <
    ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept
    thread
    are
    all blocked on flushing memstore. The handlers get blocked,
    then
    the
    readers also have a finite handoff queue and they are blocked
    and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver
    stats?
    ie:
    how much memstore global ram used? That is found on the main
    page
    of
    the regionserver http service, also found in ganglia/file
    stats.
    I havent looked at the logs yet, I'm off to lunch now.

    -ryan

    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I had 3 consecutive successful runs processing 200GB data for
    each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com
    ,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as master
    exited
    after
    that.

    I also attach master and region server log on
    sjc1-hadoop6.sjc1
    -
    pardon me
    for including individual email addresses as attachments
    wouldn't
    go
    through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <
    todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes of
    load.
    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from
    one
    version
    another from another, they are mostly compatible in
    terms
    of
    signatures (hence no exceptions) but are subtly
    incompatible
    in
    different ways. In the stack trace you posted, the
    handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry
    [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet
    it's
    doing
    nothing. This also happens to be one of the classes
    that
    were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The
    process
    could
    be
    fixed. I wouldn't mind. I don't support runtime
    checking
    of
    jars,
    there is such thing as too much tests, and this is an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade,
    copying
    the
    config
    over. With the lack of -default.xml this is easier
    than
    ever
    (just
    copy everything in conf/). With symlink switchover
    it
    makes
    roll
    forward/back as simple as doing a symlink switchover
    or
    back.
    I
    have
    to recommend this to everyone who doesnt have a
    management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar
    in
    the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories -
    0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked
    up
    and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above -
    bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib are
    of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar there



    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar
    co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase
    jars
    in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow
    attempt.
    The time out I mentioned happened for 1st and
    3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if
    you
    use
    shell
    at
    this
    time.
    Your attachement did not come across Ted.
    Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than
    before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling
    flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region server
    logs.
    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating
    (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental
    HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing with
    it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to
    set
    it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Feb 15, 2011 at 1:23 am
    Looks like it's just in the process of flushing something:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000050ce3000
    nid=0x32a9 runnable [0x0000000043138000]
    java.lang.Thread.State: RUNNABLE
    at
    org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:797)
    at
    org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:479)
    - locked <0x000000072aecd6b8> (a java.lang.Object)
    at
    org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:448)
    at
    org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:81)
    at
    org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1508)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:967)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:894)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:386)
    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:194)
    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:214)

    ie this is expected behavior. Does it not go down eventually?
    On Mon, Feb 14, 2011 at 5:17 PM, Ted Yu wrote:

    I applied Todd's patch from HBASE-3531 locally and built hbase jar.

    I restarted cluster. The first flow over 200GB data succeeded.
    The second flow got stuck. So I tried to shutdown the cluster.

    Here is stack trace for one of the region servers that refused to go down:
    http://pastebin.com/yJQhhYp8

    This is particularly interesting (other threads were waiting to lock
    0x000000070be8a028):

    1. "IPC Server handler 5 on 60020" daemon prio=10 tid=0x00002aaab883c000
    nid=0x32bb waiting on condition [0x0000000043f46000]
    2. java.lang.Thread.State: WAITING (parking)
    3. at sun.misc.Unsafe.park(Native Method)
    4. - parking to wait for <0x000000070ca468a8> (a
    java.util.concurrent.locks.ReentrantLock$NonfairSync)
    5. at
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    6. at

    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    7. at

    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
    8. at

    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
    9. at

    java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
    10. at
    java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
    11. at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:429)
    12. - locked <0x000000070be8a028> (a
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher)
    13. at

    org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2562)

    Thanks for your attention.
    On Mon, Feb 14, 2011 at 1:12 PM, Todd Lipcon wrote:

    Gotcha, OK. I think I understand what's going on then.
    I'm guessing you have a TON of log messages like this?

    2011-02-14 06:27:23,116 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegion:
    NOT flushing memstore for region

    NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3.,
    flushing=false, writesEnabled=false


    -Todd
    On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu wrote:

    The lock up happened before I shut down region server.
    I had to get out of that situation so that I can continue 0.90.1
    validation.
    On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon wrote:

    Hi Ted,

    What was the reason for the region server shutdown? ie was it
    aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd
    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu wrote:

    Here is stack trace for one region server which didn't cleanly
    shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in
    reclaimMemStoreMemory().
    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
    splitting.

    Please advise whether there could be any relation between the above snippet
    and lock up in MemStoreFlusher.

    Thanks
    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu wrote:

    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa


    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's
    repo.
    FYI


    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311,
    compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983,
    blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0,
    blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0,
    blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301,
    compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <
    ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept
    thread
    are
    all blocked on flushing memstore. The handlers get blocked,
    then
    the
    readers also have a finite handoff queue and they are blocked
    and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver
    stats?
    ie:
    how much memstore global ram used? That is found on the main
    page
    of
    the regionserver http service, also found in ganglia/file
    stats.
    I havent looked at the logs yet, I'm off to lunch now.

    -ryan

    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <yuzhihong@gmail.com
    wrote:
    I had 3 consecutive successful runs processing 200GB data
    for
    each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com
    ,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)

    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No
    master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as
    master
    exited
    after
    that.

    I also attach master and region server log on
    sjc1-hadoop6.sjc1
    -
    pardon me
    for including individual email addresses as attachments
    wouldn't
    go
    through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <
    todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes
    of
    load.
    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class from
    one
    version
    another from another, they are mostly compatible in
    terms
    of
    signatures (hence no exceptions) but are subtly
    incompatible
    in
    different ways. In the stack trace you posted, the
    handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry
    [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object monitor)

    was idle.

    The cache flusher thread should be flushing, and yet
    it's
    doing
    nothing. This also happens to be one of the classes
    that
    were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process. The
    process
    could
    be
    fixed. I wouldn't mind. I don't support runtime
    checking
    of
    jars,
    there is such thing as too much tests, and this is
    an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade,
    copying
    the
    config
    over. With the lack of -default.xml this is easier
    than
    ever
    (just
    copy everything in conf/). With symlink switchover
    it
    makes
    roll
    forward/back as simple as doing a symlink
    switchover
    or
    back.
    I
    have
    to recommend this to everyone who doesnt have a
    management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar
    in
    the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories -
    0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be picked
    up
    and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above -
    bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib
    are
    of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar
    there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar
    co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new hbase
    jars
    in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow
    attempt.
    The time out I mentioned happened for 1st
    and
    3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd attempts.

    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens if
    you
    use
    shell
    at
    this
    time.
    Your attachement did not come across Ted.
    Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with
    hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than
    before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling
    flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region
    server
    logs.
    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on
    sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000 Creating
    (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which
    didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM, Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The 0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental
    HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing
    with
    it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted to
    set
    it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Ted Yu at Feb 15, 2011 at 2:23 am
    Looks like the region server came down after I left office.
    Still, the second flow wasn't able to proceed writing to hbase. I hope such
    stress test can pass.

    I am running our flow repeatedly using 0.90.1 rc1.
    On Mon, Feb 14, 2011 at 5:23 PM, Todd Lipcon wrote:

    Looks like it's just in the process of flushing something:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000050ce3000
    nid=0x32a9 runnable [0x0000000043138000]
    java.lang.Thread.State: RUNNABLE
    at

    org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:797)
    at

    org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:479)
    - locked <0x000000072aecd6b8> (a java.lang.Object)
    at
    org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:448)
    at
    org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:81)
    at

    org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1508)
    at

    org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:967)
    at

    org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:894)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
    at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:386)
    at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:194)
    at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:214)

    ie this is expected behavior. Does it not go down eventually?
    On Mon, Feb 14, 2011 at 5:17 PM, Ted Yu wrote:

    I applied Todd's patch from HBASE-3531 locally and built hbase jar.

    I restarted cluster. The first flow over 200GB data succeeded.
    The second flow got stuck. So I tried to shutdown the cluster.

    Here is stack trace for one of the region servers that refused to go down:
    http://pastebin.com/yJQhhYp8

    This is particularly interesting (other threads were waiting to lock
    0x000000070be8a028):

    1. "IPC Server handler 5 on 60020" daemon prio=10
    tid=0x00002aaab883c000
    nid=0x32bb waiting on condition [0x0000000043f46000]
    2. java.lang.Thread.State: WAITING (parking)
    3. at sun.misc.Unsafe.park(Native Method)
    4. - parking to wait for <0x000000070ca468a8> (a
    java.util.concurrent.locks.ReentrantLock$NonfairSync)
    5. at
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    6. at

    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    7. at

    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
    8. at

    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
    9. at

    java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
    10. at
    java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
    11. at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:429)
    12. - locked <0x000000070be8a028> (a
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher)
    13. at

    org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2562)
    Thanks for your attention.
    On Mon, Feb 14, 2011 at 1:12 PM, Todd Lipcon wrote:

    Gotcha, OK. I think I understand what's going on then.
    I'm guessing you have a TON of log messages like this?

    2011-02-14 06:27:23,116 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegion:
    NOT flushing memstore for region
    NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3.,
    flushing=false, writesEnabled=false


    -Todd
    On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu wrote:

    The lock up happened before I shut down region server.
    I had to get out of that situation so that I can continue 0.90.1
    validation.

    On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Hi Ted,

    What was the reason for the region server shutdown? ie was it
    aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd

    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Here is stack trace for one region server which didn't cleanly
    shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in
    reclaimMemStoreMemory().
    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3
    was
    splitting.

    Please advise whether there could be any relation between the
    above
    snippet
    and lock up in MemStoreFlusher.

    Thanks

    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa


    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's
    repo.
    FYI


    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <yuzhihong@gmail.com
    wrote:
    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311,
    compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983,
    blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0,
    blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0,
    blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301,
    compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <
    ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the accept
    thread
    are
    all blocked on flushing memstore. The handlers get
    blocked,
    then
    the
    readers also have a finite handoff queue and they are
    blocked
    and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver
    stats?
    ie:
    how much memstore global ram used? That is found on the
    main
    page
    of
    the regionserver http service, also found in ganglia/file
    stats.
    I havent looked at the logs yet, I'm off to lunch now.

    -ryan

    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    I had 3 consecutive successful runs processing 200GB data
    for
    each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com
    ,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535,
    maxHeap=3983)
    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No
    master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as
    master
    exited
    after
    that.

    I also attach master and region server log on
    sjc1-hadoop6.sjc1
    -
    pardon me
    for including individual email addresses as attachments
    wouldn't
    go
    through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <
    todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did
    this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40 minutes
    of
    load.
    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class
    from
    one
    version
    another from another, they are mostly compatible in
    terms
    of
    signatures (hence no exceptions) but are subtly
    incompatible
    in
    different ways. In the stack trace you posted, the
    handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry
    [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object
    monitor)
    was idle.

    The cache flusher thread should be flushing, and yet
    it's
    doing
    nothing. This also happens to be one of the classes
    that
    were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process.
    The
    process
    could
    be
    fixed. I wouldn't mind. I don't support runtime
    checking
    of
    jars,
    there is such thing as too much tests, and this
    is
    an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade,
    copying
    the
    config
    over. With the lack of -default.xml this is
    easier
    than
    ever
    (just
    copy everything in conf/). With symlink
    switchover
    it
    makes
    roll
    forward/back as simple as doing a symlink
    switchover
    or
    back.
    I
    have
    to recommend this to everyone who doesnt have a
    management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads
    lib/hbase-0.90.0.jar
    in
    the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories -
    0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be
    picked
    up
    and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2 above
    -
    bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and $HBASE_HOME/lib
    are
    of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar
    there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and hbase-0.90.1.jar
    co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new
    hbase
    jars
    in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow
    attempt.
    The time out I mentioned happened for 1st
    and
    3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd
    attempts.
    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What happens
    if
    you
    use
    shell
    at
    this
    time.
    Your attachement did not come across
    Ted.
    Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with
    hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster than
    before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling
    flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region
    server
    logs.
    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on
    sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000
    Creating
    (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which
    didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM,
    Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The
    0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental
    HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try playing
    with
    it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted
    to
    set
    it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Todd Lipcon at Feb 15, 2011 at 2:28 am

    On Mon, Feb 14, 2011 at 6:23 PM, Ted Yu wrote:

    Looks like the region server came down after I left office.
    Still, the second flow wasn't able to proceed writing to hbase. I hope such
    stress test can pass.
    Does the same stress test pass on 0.90.0?

    I don't think failures should block 0.90.1, unless the same test passes on
    0.90.0. ie so long as 0.90.1 is better than our previous release, we should
    release it, and then get to work on 0.90.2 :)

    -Todd

    On Mon, Feb 14, 2011 at 5:23 PM, Todd Lipcon wrote:

    Looks like it's just in the process of flushing something:

    "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000050ce3000
    nid=0x32a9 runnable [0x0000000043138000]
    java.lang.Thread.State: RUNNABLE
    at

    org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:797)
    at

    org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:479)
    - locked <0x000000072aecd6b8> (a java.lang.Object)
    at
    org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:448)
    at
    org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:81)
    at

    org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1508)
    at

    org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:967)
    at

    org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:894)
    at
    org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
    at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:386)
    at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:194)
    at

    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:214)
    ie this is expected behavior. Does it not go down eventually?
    On Mon, Feb 14, 2011 at 5:17 PM, Ted Yu wrote:

    I applied Todd's patch from HBASE-3531 locally and built hbase jar.

    I restarted cluster. The first flow over 200GB data succeeded.
    The second flow got stuck. So I tried to shutdown the cluster.

    Here is stack trace for one of the region servers that refused to go down:
    http://pastebin.com/yJQhhYp8

    This is particularly interesting (other threads were waiting to lock
    0x000000070be8a028):

    1. "IPC Server handler 5 on 60020" daemon prio=10
    tid=0x00002aaab883c000
    nid=0x32bb waiting on condition [0x0000000043f46000]
    2. java.lang.Thread.State: WAITING (parking)
    3. at sun.misc.Unsafe.park(Native Method)
    4. - parking to wait for <0x000000070ca468a8> (a
    java.util.concurrent.locks.ReentrantLock$NonfairSync)
    5. at
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
    6. at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    7. at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
    8. at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
    9. at
    java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
    10. at
    java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
    11. at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:429)
    12. - locked <0x000000070be8a028> (a
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher)
    13. at
    org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2562)
    Thanks for your attention.
    On Mon, Feb 14, 2011 at 1:12 PM, Todd Lipcon wrote:

    Gotcha, OK. I think I understand what's going on then.
    I'm guessing you have a TON of log messages like this?

    2011-02-14 06:27:23,116 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegion:
    NOT flushing memstore for region
    NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3.,
    flushing=false, writesEnabled=false


    -Todd
    On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu wrote:

    The lock up happened before I shut down region server.
    I had to get out of that situation so that I can continue 0.90.1
    validation.

    On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon <todd@cloudera.com>
    wrote:
    Hi Ted,

    What was the reason for the region server shutdown? ie was it
    aborting
    itself, or you send it a kill signal, or what?

    Still trying to understand why this happened.

    -Todd

    On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    Here is stack trace for one region server which didn't cleanly
    shutdown:
    http://pastebin.com/PEtEdi4g

    I noticed IPC Server handler 7 was holding lock in
    reclaimMemStoreMemory().
    Here is related snippet from the region server's log:
    http://pastebin.com/KeXppURX

    I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3
    was
    splitting.

    Please advise whether there could be any relation between the
    above
    snippet
    and lock up in MemStoreFlusher.

    Thanks

    On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu <yuzhihong@gmail.com>
    wrote:
    I disabled MSLAB.
    My flow still couldn't make much progress.

    In this region server stack trace, I don't see
    MemStoreFlusher.reclaimMemStoreMemory() call:
    http://pastebin.com/uiBRidUa


    On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    I am using hadoop-core-0.20.2-322.jar downloaded from
    Ryan's
    repo.
    FYI


    On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    Since master server shut down, I restarted the cluster.
    The next flow over 200GB data got timed out.

    Here are some region server stat:

    request=0.0, regions=95, stores=213, storefiles=65,
    storefileIndexSize=99, memstoreSize=1311,
    compactionQueueSize=0,
    flushQueueSize=0, usedHeap=2532, maxHeap=3983,
    blockCacheSize=6853968,
    blockCacheFree=828520304, blockCacheCount=0,
    blockCacheHitCount=0,
    blockCacheMissCount=0, blockCacheEvictedCount=0,
    blockCacheHitRatio=0,
    blockCacheHitCachingRatio=0

    request=0.0, regions=95, stores=232, storefiles=72,
    storefileIndexSize=120, memstoreSize=301,
    compactionQueueSize=0,
    flushQueueSize=0, usedHeap=1740, maxHeap=3983,
    blockCacheSize=13110928,
    blockCacheFree=822263344, blockCacheCount=712,
    blockCacheHitCount=112478,
    blockCacheMissCount=712, blockCacheEvictedCount=0,
    blockCacheHitRatio=99,
    blockCacheHitCachingRatio=99

    Thanks


    On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <
    ryanobjc@gmail.com
    wrote:
    every handler thread, and every reader and also the
    accept
    thread
    are
    all blocked on flushing memstore. The handlers get
    blocked,
    then
    the
    readers also have a finite handoff queue and they are
    blocked
    and
    also
    the accept.

    But why isnt memstore flushing? Do you have regionserver
    stats?
    ie:
    how much memstore global ram used? That is found on the
    main
    page
    of
    the regionserver http service, also found in ganglia/file
    stats.
    I havent looked at the logs yet, I'm off to lunch now.

    -ryan

    On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    I had 3 consecutive successful runs processing 200GB
    data
    for
    each
    run
    before hitting timeout problem in the 4th run.

    The 5th run couldn't proceed because master complained:

    2011-02-13 16:11:45,173 FATAL
    org.apache.hadoop.hbase.master.HMaster:
    Failed
    assignment of regions to
    serverName=sjc1-hadoop6.sjc1.carrieriq.com
    ,60020,1297518996557,
    load=(requests=0, regions=231, usedHeap=3535,
    maxHeap=3983)
    but sjc1-hadoop6.sjc1 claimed:
    2011-02-13 16:13:32,258 DEBUG
    org.apache.hadoop.hbase.regionserver.HRegionServer: No
    master
    found,
    will
    retry

    Here is stack trace for sjc1-hadoop6.sjc1:
    http://pastebin.com/X8zWLXqu

    I didn't have chance to capture master stack trace as
    master
    exited
    after
    that.

    I also attach master and region server log on
    sjc1-hadoop6.sjc1
    -
    pardon me
    for including individual email addresses as attachments
    wouldn't
    go
    through
    hbase.apache.org

    On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <
    todd@cloudera.com>
    wrote:
    On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Thanks for the explanation.
    Assuming the mixed class loading is static, why did
    this
    situation
    develop
    after 40 minutes of heavy load :-(
    You didn't hit global memstore pressure until 40
    minutes
    of
    load.
    -Todd

    On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    It's a standard linking issue, you get one class
    from
    one
    version
    another from another, they are mostly compatible
    in
    terms
    of
    signatures (hence no exceptions) but are subtly
    incompatible
    in
    different ways. In the stack trace you posted, the
    handlers
    were
    blocked in:

    at
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
    and the thread:

    "regionserver60020.cacheFlusher" daemon prio=10
    tid=0x00002aaabc21e000
    nid=0x7717 waiting for monitor entry
    [0x0000000000000000]
    java.lang.Thread.State: BLOCKED (on object
    monitor)
    was idle.

    The cache flusher thread should be flushing, and
    yet
    it's
    doing
    nothing. This also happens to be one of the
    classes
    that
    were
    changed.



    On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    Can someone comment on my second question ?
    Thanks

    On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
    ryanobjc@gmail.com>
    wrote:
    As I suspected.

    It's a byproduct of our maven assembly process.
    The
    process
    could
    be
    fixed. I wouldn't mind. I don't support runtime
    checking
    of
    jars,
    there is such thing as too much tests, and this
    is
    an
    example
    of
    it.
    The check would then need a test, etc, etc.

    At SU we use new directories for each upgrade,
    copying
    the
    config
    over. With the lack of -default.xml this is
    easier
    than
    ever
    (just
    copy everything in conf/). With symlink
    switchover
    it
    makes
    roll
    forward/back as simple as doing a symlink
    switchover
    or
    back.
    I
    have
    to recommend this to everyone who doesnt have a
    management
    scheme.
    On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <
    yuzhihong@gmail.com
    wrote:
    hbase/hbase-0.90.1.jar leads
    lib/hbase-0.90.0.jar
    in
    the
    classpath.
    I wonder
    1. why hbase jar is placed in two directories
    -
    0.20.6
    didn't use
    such
    structure
    2. what from lib/hbase-0.90.0.jar could be
    picked
    up
    and
    why
    there
    wasn't
    exception in server log

    I think a JIRA should be filed for item 2
    above
    -
    bail
    out
    when
    the
    two
    hbase jars from $HBASE_HOME and
    $HBASE_HOME/lib
    are
    of
    different
    versions.
    Cheers

    On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson
    <
    ryanobjc@gmail.com>
    wrote:
    What do you get when you:

    ls lib/hbase*

    I'm going to guess there is hbase-0.90.0.jar
    there


    On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    hbase-0.90.0-tests.jar and
    hbase-0.90.1.jar
    co-exist
    Would this be a problem ?

    On Thu, Feb 10, 2011 at 3:16 PM, Ryan
    Rawson
    <ryanobjc@gmail.com
    wrote:
    You don't have both the old and the new
    hbase
    jars
    in
    there
    do
    you?
    -ryan

    On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
    yuzhihong@gmail.com>
    wrote:
    .META. went offline during second flow
    attempt.
    The time out I mentioned happened for
    1st
    and
    3rd
    attempts.
    HBase
    was
    restarted before the 1st and 3rd
    attempts.
    Here is jstack:
    http://pastebin.com/EHMSvsRt

    On Thu, Feb 10, 2011 at 3:04 PM, Stack
    <
    stack@duboce.net>
    wrote:
    So, .META. is not online? What
    happens
    if
    you
    use
    shell
    at
    this
    time.
    Your attachement did not come across
    Ted.
    Mind
    postbin'ing
    it?
    St.Ack

    On Thu, Feb 10, 2011 at 2:41 PM, Ted
    Yu
    <yuzhihong@gmail.com
    wrote:
    I replaced hbase jar with
    hbase-0.90.1.jar
    I also upgraded client side jar to
    hbase-0.90.1.jar
    Our map tasks were running faster
    than
    before
    for
    about
    50
    minutes.
    However,
    map tasks then timed out calling
    flushCommits().
    This
    happened
    even
    after
    fresh restart of hbase.

    I don't see any exception in region
    server
    logs.
    In master log, I found:

    2011-02-10 18:24:15,286 DEBUG
    org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
    Opened
    region
    -ROOT-,,0.70236052 on
    sjc1-hadoop6.X.com
    ,60020,1297362251595
    2011-02-10 18:24:15,349 INFO
    org.apache.hadoop.hbase.catalog.CatalogTracker:
    Failed verification of .META.,,1 at
    address=null;
    org.apache.hadoop.hbase.NotServingRegionException:
    org.apache.hadoop.hbase.NotServingRegionException:
    Region
    is
    not
    online:
    .META.,,1
    2011-02-10 18:24:15,350 DEBUG
    org.apache.hadoop.hbase.zookeeper.ZKAssign:
    master:60000-0x12e10d0e31e0000
    Creating
    (or
    updating)
    unassigned
    node
    for
    1028785192 with OFFLINE state

    I am attaching region server (which
    didn't
    respond
    to
    stop-hbase.sh)
    jstack.
    FYI

    On Thu, Feb 10, 2011 at 10:10 AM,
    Stack
    <stack@duboce.net>
    wrote:
    Thats probably enough Ted. The
    0.90.1
    hbase-default.xml
    has
    an
    extra
    config. to enable the experimental
    HBASE-3455
    feature
    but
    you
    can
    copy
    that over if you want to try
    playing
    with
    it
    (it
    defaults
    off
    so
    you'd
    copy over the config. if you wanted
    to
    set
    it
    to
    true).
    St.Ack


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera


    --
    Todd Lipcon
    Software Engineer, Cloudera

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedFeb 14, '11 at 6:40p
activeFeb 15, '11 at 2:28a
posts10
users2
websitehbase.apache.org

2 users in discussion

Ted Yu: 5 posts Todd Lipcon: 5 posts

People

Translate

site design / logo © 2022 Grokbase