Grokbase Groups HBase dev June 2008
FAQ
On OOME, regionserver sticks around and doesn't go down with cluster
--------------------------------------------------------------------

Key: HBASE-706
URL: https://issues.apache.org/jira/browse/HBASE-706
Project: Hadoop HBase
Issue Type: Bug
Reporter: stack
Fix For: 0.2.0


On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:

{code}
2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.(HashSet.java:103)
at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
java.io.IOException: Server not running
at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
{code}

If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.

Moving this into 0.2. Seems important to fix if robustness is name of the game.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • stack (JIRA) at Jul 7, 2008 at 6:57 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611290#action_12611290 ]

    stack commented on HBASE-706:
    -----------------------------

    In a previous application, we'd set aside a bit of memory to release when application hit an OOME. The reservation was done on startup. It was a linked list of sizeable blocks rather than one big monolithic block, probably so allocation would still work in a fragmented heap. In that apps' case, default was a single block of 5M. Maybe in hbase, set aside more? 20M? In 4 blocks? Make it configurable? What you think? Default hbase heap is 1G I believe (See the bin/hbase script)

    Main loop in the application was wrapped in a try/catch. On 'serious error', we'd first release the memory resevoir -- the release could be run by more than one thread so some times it'd be a noop -- and then run code to set the application into a safe 'park' so could be analyzed later by operator. In our case, things would be a little trickier because there is more than just the one loop. The OOME could bubble out in the main master/regionserver loops or in one of the service thread loops. You'd have to plug in the OOME processing into all places (some inherit from Chore so you could add processing there). We also want our regionserver to go all the ways down if it hit an OOME to minimize damage done.

    I'd imagine that all you'd do is if OOME, release the memory and then let the shutdown proceed normally. Hopefully, the very release of the resevoir should be sufficient to making a successful shutdown.

    Tests will be hard. There is an OOMERegionServer. You might play with that. You probably won't be able to inline it as unit test. Thats OK, I think. Also, I don't think its possible to write a handler that will work in all cases, just most: e.g. there may be a pathological case where the just-release resevoir gets eaten up immediately but a rampant thread. I think we'll just have to make up a patch that does the above, commit it and then watch how well it does out in the field.


    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Fix For: 0.2.0


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jean-Daniel Cryans (JIRA) at Jul 7, 2008 at 7:04 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jean-Daniel Cryans reassigned HBASE-706:
    ----------------------------------------

    Assignee: Jean-Daniel Cryans
    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: Jean-Daniel Cryans
    Fix For: 0.2.0


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jean-Daniel Cryans (JIRA) at Jul 8, 2008 at 12:27 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611548#action_12611548 ]

    Jean-Daniel Cryans commented on HBASE-706:
    ------------------------------------------

    I added a reservation linked list that takes a number of blocks of 5M defined by configuration and by default of 4. Using a HRegionServer that works like OOMERegionServer (was easier I thought), the loader.jsp in attachment and the reservation list, here is a trace of a OOME that is correctly handled :

    {noformat}
    2008-07-07 17:51:42,853 DEBUG org.apache.hadoop.hbase.regionserver.HStore: Completed compaction of 1705462391/contents store size is 23.7m
    2008-07-07 17:51:42,868 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on region loading,53841,1215463619617 in 6sec
    2008-07-07 17:52:23,425 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Ran out of memory
    java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2760)
    at java.util.Arrays.copyOf(Arrays.java:2734)
    at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
    at java.util.ArrayList.add(ArrayList.java:351)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1152)
    ...
    2008-07-07 17:52:23,587 DEBUG org.apache.hadoop.hbase.RegionHistorian: Offlined
    2008-07-07 17:52:23,588 INFO org.apache.hadoop.ipc.Server: Stopping server on 60020
    2008-07-07 17:52:23,588 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 60020: exiting
    2008-07-07 17:52:23,588 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 60020: exiting
    2008-07-07 17:52:23,588 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020: exiting
    2008-07-07 17:52:23,588 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 60020: exiting
    ...
    2008-07-07 17:52:33,431 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
    2008-07-07 17:52:33,432 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/0:0:0:0:0:0:0:0:60020 exiting
    2008-07-07 17:52:33,432 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
    2008-07-07 17:52:33,432 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
    {noformat}

    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: Jean-Daniel Cryans
    Fix For: 0.2.0


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jean-Daniel Cryans (JIRA) at Jul 8, 2008 at 12:28 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jean-Daniel Cryans updated HBASE-706:
    -------------------------------------

    Attachment: loader.jsp
    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: Jean-Daniel Cryans
    Fix For: 0.2.0

    Attachments: loader.jsp


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jean-Daniel Cryans (JIRA) at Jul 8, 2008 at 12:45 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jean-Daniel Cryans updated HBASE-706:
    -------------------------------------

    Attachment: hbase-706-v1.patch

    Please review. I added 2 OOME catch, maybe more is needed.
    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: Jean-Daniel Cryans
    Fix For: 0.2.0

    Attachments: hbase-706-v1.patch, loader.jsp


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jean-Daniel Cryans (JIRA) at Jul 8, 2008 at 12:53 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jean-Daniel Cryans updated HBASE-706:
    -------------------------------------

    Attachment: hbase-706-v1.patch
    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: Jean-Daniel Cryans
    Fix For: 0.2.0

    Attachments: hbase-706-v1.patch, loader.jsp


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jean-Daniel Cryans (JIRA) at Jul 8, 2008 at 12:53 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jean-Daniel Cryans updated HBASE-706:
    -------------------------------------

    Attachment: (was: hbase-706-v1.patch)
    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: Jean-Daniel Cryans
    Fix For: 0.2.0

    Attachments: hbase-706-v1.patch, loader.jsp


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jul 8, 2008 at 9:12 pm
    [ https://issues.apache.org/jira/browse/HBASE-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack resolved HBASE-706.
    -------------------------

    Resolution: Fixed

    Committed. Thanks for the patch J-D.
    On OOME, regionserver sticks around and doesn't go down with cluster
    --------------------------------------------------------------------

    Key: HBASE-706
    URL: https://issues.apache.org/jira/browse/HBASE-706
    Project: Hadoop HBase
    Issue Type: Bug
    Reporter: stack
    Assignee: Jean-Daniel Cryans
    Fix For: 0.2.0

    Attachments: hbase-706-v1.patch, loader.jsp


    On John Gray cluster, an errant, massive, store file caused us OOME. Shutdown of cluster left this regionserver in place. A thread dump failed with OOME. Here is last thing in log:
    {code}
    2008-06-25 03:21:55,111 INFO org.apache.hadoop.hbase.HRegionServer: worker thread exiting
    2008-06-25 03:24:26,923 FATAL org.apache.hadoop.hbase.HRegionServer: Set stop flag in regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher
    java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.<init>(HashMap.java:226)
    at java.util.HashSet.<init>(HashSet.java:103)
    at org.apache.hadoop.hbase.HRegionServer.getRegionsToCheck(HRegionServer.java:1789)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.enqueueOptionalFlushRegions(HRegionServer.java:479)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:385)
    2008-06-25 03:24:26,923 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 60020, call batchUpdate(items,,1214272763124, 9223372036854775807, org.apache.hadoop.hbase.io.BatchUpdate@67d6b1e2) from 192.168.249.230:38278: error: java.io.IOException: Server not running
    java.io.IOException: Server not running
    at org.apache.hadoop.hbase.HRegionServer.checkOpen(HRegionServer.java:1758)
    at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1547)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
    {code}
    If I get an OOME just trying to threaddump, would seem to indicate we need to start keeping a little memory resevoir around for emergencies such as this just so we can shutdown clean.
    Moving this into 0.2. Seems important to fix if robustness is name of the game.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedJun 25, '08 at 5:35p
activeJul 8, '08 at 9:12p
posts9
users1
websitehbase.apache.org

1 user in discussion

stack (JIRA): 9 posts

People

Translate

site design / logo © 2022 Grokbase