Grokbase Groups HBase user May 2016
FAQ
On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a)
always in failed close state, So balancer could not run.


i check the region on RS, and found logs about this region

2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up
'too many store files'; waited long enough... proceeding with flush of
frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
2016-05-26 12:42:20,043 INFO
  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
regionserver.HRegionServer:
dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
requesting flush for region
frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
after a delay of 20753
2016-05-26 12:42:30,043 INFO
  [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
regionserver.HRegionServer:
dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
requesting flush for region
frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
after a delay of 7057


relate jstack information like below:

Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
   State: WAITING
   Blocked count: 1
   Waited count: 2
   Waiting on org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
   Stack:
     java.lang.Object.wait(Native Method)
     java.lang.Object.wait(Object.java:502)
     org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
     org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
     org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
     org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
     org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     java.lang.Thread.run(Thread.java:745)


Our HBase cluster version is 1.1.1, i try to compact this region,
compact stuck in progress 89.58%

frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
85860221 85860221
89.58%

Search Discussions

  • Heng Chen at May 26, 2016 at 6:44 am
    And there is another question about failed close state, does it mean the
    region in this state could be read and write normally?

    2016-05-26 12:48 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
    On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a)
    always in failed close state, So balancer could not run.


    i check the region on RS, and found logs about this region

    2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
    regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up
    'too many store files'; waited long enough... proceeding with flush of
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    2016-05-26 12:42:20,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 20753
    2016-05-26 12:42:30,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 7057


    relate jstack information like below:

    Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
    State: WAITING
    Blocked count: 1
    Waited count: 2
    Waiting on org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
    Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)
    org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
    org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)


    Our HBase cluster version is 1.1.1, i try to compact this region, compact stuck in progress 89.58%

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a. 85860221 85860221
    89.58%
  • Geovanie Marquez at May 26, 2016 at 11:03 am
    No it cannot, that region is in transition. Check out hbase hbck for more
    details, and sometimes hbase hbck repair takes care of this.
    On May 26, 2016 2:44 AM, "Heng Chen" wrote:

    And there is another question about failed close state, does it mean the
    region in this state could be read and write normally?

    2016-05-26 12:48 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
    On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a)
    always in failed close state, So balancer could not run.


    i check the region on RS, and found logs about this region

    2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
    regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up
    'too many store files'; waited long enough... proceeding with flush of
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    2016-05-26 12:42:20,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 20753
    2016-05-26 12:42:30,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 7057


    relate jstack information like below:

    Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
    State: WAITING
    Blocked count: 1
    Waited count: 2
    Waiting on
    org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
    Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)
    org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)

    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
    org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)

    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)

    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)


    Our HBase cluster version is 1.1.1, i try to compact this region,
    compact stuck in progress 89.58%
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    85860221 85860221
    89.58%
  • Ted Yu at May 26, 2016 at 1:50 pm
    Heng:
    Can you pastebin the complete stack trace for the region server ?

    Snippet from region server log may also provide more clue.

    Thanks
    On Wed, May 25, 2016 at 9:48 PM, Heng Chen wrote:

    On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a)
    always in failed close state, So balancer could not run.


    i check the region on RS, and found logs about this region

    2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
    regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up
    'too many store files'; waited long enough... proceeding with flush of

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    2016-05-26 12:42:20,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 20753
    2016-05-26 12:42:30,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 7057


    relate jstack information like below:

    Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
    State: WAITING
    Blocked count: 1
    Waited count: 2
    Waiting on
    org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
    Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)

    org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)

    org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)

    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)

    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)


    Our HBase cluster version is 1.1.1, i try to compact this region,
    compact stuck in progress 89.58%


    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    85860221 85860221
    89.58%
  • Heng Chen at May 27, 2016 at 2:45 am
    Thanks guys, yesterday i restart relate RS and failed close region reopen
    successfuly. But today, there is another region fall in this state.

    I paste relate RS' jstack information. This time the failed close region
    is 9368190b3ba46238534b6307702aabae

    2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
    Heng:
    Can you pastebin the complete stack trace for the region server ?

    Snippet from region server log may also provide more clue.

    Thanks
    On Wed, May 25, 2016 at 9:48 PM, Heng Chen wrote:

    On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a)
    always in failed close state, So balancer could not run.


    i check the region on RS, and found logs about this region

    2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
    regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up
    'too many store files'; waited long enough... proceeding with flush of

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    2016-05-26 12:42:20,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 20753
    2016-05-26 12:42:30,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 7057


    relate jstack information like below:

    Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
    State: WAITING
    Blocked count: 1
    Waited count: 2
    Waiting on
    org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
    Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)

    org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)

    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)

    org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)


    Our HBase cluster version is 1.1.1, i try to compact this region,
    compact stuck in progress 89.58%


    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    85860221 85860221
    89.58%
  • Heng Chen at May 30, 2016 at 9:59 am
    I find something useful.

    When we do region.close, if there is one compaction or flush in progress,
    close will wait for compaction or flush be finished.

    {code: title=HRegion.java}

    @Override
    public void waitForFlushesAndCompactions() {
       synchronized (writestate) {
         if (this.writestate.readOnly) {
           // we should not wait for replayed flushed if we are read only
    (for example in case the
           // region is a secondary replica).
           return;
         }
         boolean interrupted = false;
         try {
           while (writestate.compacting > 0 || writestate.flushing) {
             LOG.debug("waiting for " + writestate.compacting + " compactions"
               + (writestate.flushing ? " & cache flush" : "") + " to
    complete for region " + this);
             try {
               writestate.wait();
             } catch (InterruptedException iex) {
               // essentially ignore and propagate the interrupt back up
               LOG.warn("Interrupted while waiting");
               interrupted = true;
             }
           }
         } finally {
           if (interrupted) {
             Thread.currentThread().interrupt();
           }
         }
       }
    }

    {code}

    And writestate.flushing will be set to be true in two place:

    HRegion.flushCache and HRegion.replayWALFlushStartMarker

    {code: title=HRegion.flushCache}

    synchronized (writestate) {
       if (!writestate.flushing && writestate.writesEnabled) {
         this.writestate.flushing = true;
       } else {
         *....*
       }
    }

    {code}

    {code: title=HRegion.replayWALFlushStartMarker}

    synchronized (writestate) {
       try {
         *....*
         if (!writestate.flushing) {

             this.writestate.flushing = true;
             *...*

    * }*

    {code}


    Notice that, in HRegion.replayWALFlushStartMarker, we did not check
    writestate.writesEnabled before set writestate.flushing to be true.

    So if region.close wake up in writestate.wait, but the lock acquried by
    HRegion.replayWALFlushStartMarker, the flushing will be set to be true
    again, and region.close will stuck in writestate.wait forever.


    Will it happen in real logical?


    2016-05-27 10:44 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
    Thanks guys, yesterday i restart relate RS and failed close region reopen
    successfuly. But today, there is another region fall in this state.

    I paste relate RS' jstack information. This time the failed close region
    is 9368190b3ba46238534b6307702aabae

    2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
    Heng:
    Can you pastebin the complete stack trace for the region server ?

    Snippet from region server log may also provide more clue.

    Thanks

    On Wed, May 25, 2016 at 9:48 PM, Heng Chen <heng.chen.1986@gmail.com>
    wrote:
    On master web UI, i could see region (c371fb20c372b8edbf54735409ab5c4a)
    always in failed close state, So balancer could not run.


    i check the region on RS, and found logs about this region

    2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
    regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean up
    'too many store files'; waited long enough... proceeding with flush of

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    2016-05-26 12:42:20,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 20753
    2016-05-26 12:42:30,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 7057


    relate jstack information like below:

    Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
    State: WAITING
    Blocked count: 1
    Waited count: 2
    Waiting on
    org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
    Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)

    org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)

    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
    org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)


    Our HBase cluster version is 1.1.1, i try to compact this region,
    compact stuck in progress 89.58%


    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    85860221 85860221
    89.58%
  • Ted Yu at May 30, 2016 at 1:57 pm
    There is debug log in HRegion#replayWALFlushStartMarker :

                   LOG.debug(getRegionInfo().getEncodedName() + " : "

                       + " Prepared flush with seqId:" +
    flush.getFlushSequenceNumber());

    ...

                     LOG.debug(getRegionInfo().getEncodedName() + " : "

                       + " Prepared empty flush with seqId:" +
    flush.getFlushSequenceNumber());

    I searched for them in the log you attached to HBASE-15900 but didn't find
    any occurrence.

    FYI
    On Mon, May 30, 2016 at 2:59 AM, Heng Chen wrote:

    I find something useful.

    When we do region.close, if there is one compaction or flush in progress,
    close will wait for compaction or flush be finished.

    {code: title=HRegion.java}

    @Override
    public void waitForFlushesAndCompactions() {
    synchronized (writestate) {
    if (this.writestate.readOnly) {
    // we should not wait for replayed flushed if we are read only
    (for example in case the
    // region is a secondary replica).
    return;
    }
    boolean interrupted = false;
    try {
    while (writestate.compacting > 0 || writestate.flushing) {
    LOG.debug("waiting for " + writestate.compacting + " compactions"
    + (writestate.flushing ? " & cache flush" : "") + " to
    complete for region " + this);
    try {
    writestate.wait();
    } catch (InterruptedException iex) {
    // essentially ignore and propagate the interrupt back up
    LOG.warn("Interrupted while waiting");
    interrupted = true;
    }
    }
    } finally {
    if (interrupted) {
    Thread.currentThread().interrupt();
    }
    }
    }
    }

    {code}

    And writestate.flushing will be set to be true in two place:

    HRegion.flushCache and HRegion.replayWALFlushStartMarker

    {code: title=HRegion.flushCache}

    synchronized (writestate) {
    if (!writestate.flushing && writestate.writesEnabled) {
    this.writestate.flushing = true;
    } else {
    *....*
    }
    }

    {code}

    {code: title=HRegion.replayWALFlushStartMarker}

    synchronized (writestate) {
    try {
    *....*
    if (!writestate.flushing) {

    this.writestate.flushing = true;
    *...*

    * }*

    {code}


    Notice that, in HRegion.replayWALFlushStartMarker, we did not check
    writestate.writesEnabled before set writestate.flushing to be true.

    So if region.close wake up in writestate.wait, but the lock acquried by
    HRegion.replayWALFlushStartMarker, the flushing will be set to be true
    again, and region.close will stuck in writestate.wait forever.


    Will it happen in real logical?


    2016-05-27 10:44 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
    Thanks guys, yesterday i restart relate RS and failed close region reopen
    successfuly. But today, there is another region fall in this state.

    I paste relate RS' jstack information. This time the failed close region
    is 9368190b3ba46238534b6307702aabae

    2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
    Heng:
    Can you pastebin the complete stack trace for the region server ?

    Snippet from region server log may also provide more clue.

    Thanks

    On Wed, May 25, 2016 at 9:48 PM, Heng Chen <heng.chen.1986@gmail.com>
    wrote:
    On master web UI, i could see region
    (c371fb20c372b8edbf54735409ab5c4a)
    always in failed close state, So balancer could not run.


    i check the region on RS, and found logs about this region

    2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
    regionserver.MemStoreFlusher: Waited 90447ms on a compaction to clean
    up
    'too many store files'; waited long enough... proceeding with flush of
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    2016-05-26 12:42:20,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 20753
    2016-05-26 12:42:30,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 7057


    relate jstack information like below:

    Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
    State: WAITING
    Blocked count: 1
    Waited count: 2
    Waiting on
    org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
    Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)
    org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
    org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)


    Our HBase cluster version is 1.1.1, i try to compact this region,
    compact stuck in progress 89.58%

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    85860221 85860221
    89.58%
  • Heng Chen at May 31, 2016 at 2:06 am
    @Ted. the log i paste is in INFO Level, i have change it to be DEBUG
    Level yesterday, If this happen again, i will upload the debug level log.

    2016-05-30 21:57 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
    There is debug log in HRegion#replayWALFlushStartMarker :

    LOG.debug(getRegionInfo().getEncodedName() + " : "

    + " Prepared flush with seqId:" +
    flush.getFlushSequenceNumber());

    ...

    LOG.debug(getRegionInfo().getEncodedName() + " : "

    + " Prepared empty flush with seqId:" +
    flush.getFlushSequenceNumber());

    I searched for them in the log you attached to HBASE-15900 but didn't find
    any occurrence.

    FYI
    On Mon, May 30, 2016 at 2:59 AM, Heng Chen wrote:

    I find something useful.

    When we do region.close, if there is one compaction or flush in progress,
    close will wait for compaction or flush be finished.

    {code: title=HRegion.java}

    @Override
    public void waitForFlushesAndCompactions() {
    synchronized (writestate) {
    if (this.writestate.readOnly) {
    // we should not wait for replayed flushed if we are read only
    (for example in case the
    // region is a secondary replica).
    return;
    }
    boolean interrupted = false;
    try {
    while (writestate.compacting > 0 || writestate.flushing) {
    LOG.debug("waiting for " + writestate.compacting + " compactions"
    + (writestate.flushing ? " & cache flush" : "") + " to
    complete for region " + this);
    try {
    writestate.wait();
    } catch (InterruptedException iex) {
    // essentially ignore and propagate the interrupt back up
    LOG.warn("Interrupted while waiting");
    interrupted = true;
    }
    }
    } finally {
    if (interrupted) {
    Thread.currentThread().interrupt();
    }
    }
    }
    }

    {code}

    And writestate.flushing will be set to be true in two place:

    HRegion.flushCache and HRegion.replayWALFlushStartMarker

    {code: title=HRegion.flushCache}

    synchronized (writestate) {
    if (!writestate.flushing && writestate.writesEnabled) {
    this.writestate.flushing = true;
    } else {
    *....*
    }
    }

    {code}

    {code: title=HRegion.replayWALFlushStartMarker}

    synchronized (writestate) {
    try {
    *....*
    if (!writestate.flushing) {

    this.writestate.flushing = true;
    *...*

    * }*

    {code}


    Notice that, in HRegion.replayWALFlushStartMarker, we did not check
    writestate.writesEnabled before set writestate.flushing to be true.

    So if region.close wake up in writestate.wait, but the lock acquried by
    HRegion.replayWALFlushStartMarker, the flushing will be set to be true
    again, and region.close will stuck in writestate.wait forever.


    Will it happen in real logical?


    2016-05-27 10:44 GMT+08:00 Heng Chen <heng.chen.1986@gmail.com>:
    Thanks guys, yesterday i restart relate RS and failed close region reopen
    successfuly. But today, there is another region fall in this state.

    I paste relate RS' jstack information. This time the failed close
    region
    is 9368190b3ba46238534b6307702aabae

    2016-05-26 21:50 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
    Heng:
    Can you pastebin the complete stack trace for the region server ?

    Snippet from region server log may also provide more clue.

    Thanks

    On Wed, May 25, 2016 at 9:48 PM, Heng Chen <heng.chen.1986@gmail.com>
    wrote:
    On master web UI, i could see region
    (c371fb20c372b8edbf54735409ab5c4a)
    always in failed close state, So balancer could not run.


    i check the region on RS, and found logs about this region

    2016-05-26 12:42:10,490 INFO [MemStoreFlusher.1]
    regionserver.MemStoreFlusher: Waited 90447ms on a compaction to
    clean
    up
    'too many store files'; waited long enough... proceeding with flush
    of
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    2016-05-26 12:42:20,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 20753
    2016-05-26 12:42:30,043 INFO
    [dx-pipe-regionserver4-online,16020,1464166626969_ChoreService_1]
    regionserver.HRegionServer:
    dx-pipe-regionserver4-online,16020,1464166626969-MemstoreFlusherChore
    requesting flush for region
    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    after a delay of 7057


    relate jstack information like below:

    Thread 12403 (RS_CLOSE_REGION-dx-pipe-regionserver4-online:16020-2):
    State: WAITING
    Blocked count: 1
    Waited count: 2
    Waiting on
    org.apache.hadoop.hbase.regionserver.HRegion$WriteState@1390594c
    Stack:
    java.lang.Object.wait(Native Method)
    java.lang.Object.wait(Object.java:502)
    org.apache.hadoop.hbase.regionserver.HRegion.waitForFlushesAndCompactions(HRegion.java:1512)
    org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1371)
    org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1336)
    org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
    org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    java.lang.Thread.run(Thread.java:745)


    Our HBase cluster version is 1.1.1, i try to compact this region,
    compact stuck in progress 89.58%

    frog_stastic,\xFC\xAD\xD4\x07_{211}_1460209650596,1464149036644.c371fb20c372b8edbf54735409ab5c4a.
    85860221 85860221
    89.58%

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMay 26, '16 at 4:48a
activeMay 31, '16 at 2:06a
posts8
users3
websitehbase.apache.org

People

Translate

site design / logo © 2018 Grokbase