Grokbase Groups HBase dev June 2009
FAQ
rare race condition can take down a regionserver.
--------------------------------------------------

Key: HBASE-1569
URL: https://issues.apache.org/jira/browse/HBASE-1569
Project: Hadoop HBase
Issue Type: Bug
Affects Versions: 0.20.0
Reporter: ryan rawson
Priority: Critical
Fix For: 0.20.0


this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:

java.lang.IllegalAccessError: Call open first
at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
at java.lang.Thread.run(Thread.java:619)


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • ryan rawson (JIRA) at Jun 23, 2009 at 8:20 am
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723002#action_12723002 ]

    ryan rawson commented on HBASE-1569:
    ------------------------------------

    this is a race condition, here is how it happens:

    doMetrics() calls getStorefilesIndexSize() which gets a view of the storefiles ConcurrentSkipListMap at some point in time. Working on this snapshot it calls each store file in turn asking for the index size.

    In another thread, the compaction completion code finishes, first thing it does is:
    - remove store files from the storefiles list.
    - do some stuff
    - close the aforementioned store files, which causes the this.reader to become null.

    Back in thread #1, we run into the this.reader == null, and we throw the exception.

    So we need to do either of:
    - sync on this map, use a synced versin of the map
    - allow the ability to check this metrics without causing a RS abort when we hit an exception. Either catch it, or prevent it from happening.
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Priority: Critical
    Fix For: 0.20.0


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jun 23, 2009 at 11:33 pm
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723361#action_12723361 ]

    stack commented on HBASE-1569:
    ------------------------------

    At first I thought that use of ConcurrentSkipListSet the problem but thinking on it more, rather, we need to make code tolerate fact that a file has been moved or removed. Alternative is syncing around file operations till they complete which is too much to ask.

    A good while a go, an issue in metrics got HRS stuck in an infinite loop.

    Let me try hack up a patch.
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Priority: Critical
    Fix For: 0.20.0


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jun 24, 2009 at 12:48 am
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1569:
    -------------------------

    Attachment: sf.patch

    Unfinished first attempt.
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Priority: Critical
    Fix For: 0.20.0

    Attachments: sf.patch


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jun 24, 2009 at 5:02 am
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1569:
    -------------------------

    Attachment: 1569-v2.patch

    Patch does following:

    + Wraps metrics in a try/catch that catches any exception, logs it and then moves on rather than let it out and kill HRS
    + Changed getReader so it doesn't do IllegalArgumentException with "must open first" but instead lets out the null Reader.
    + Then, changed whereever we get a Reader to check for null. If null log it so we can see extent of phenomeon but in general just keep going.
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Priority: Critical
    Fix For: 0.20.0

    Attachments: 1569-v2.patch, sf.patch


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jun 24, 2009 at 5:02 am
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack reassigned HBASE-1569:
    ----------------------------

    Assignee: stack
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: stack
    Priority: Critical
    Fix For: 0.20.0

    Attachments: 1569-v2.patch, sf.patch


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jun 24, 2009 at 5:02 am
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1569:
    -------------------------

    Status: Patch Available (was: Open)
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Priority: Critical
    Fix For: 0.20.0

    Attachments: 1569-v2.patch, sf.patch


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Lars George (JIRA) at Jun 25, 2009 at 9:51 am
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723981#action_12723981 ]

    Lars George commented on HBASE-1569:
    ------------------------------------

    I ran into the same issue, killed the HRS that hosted the ROOT partition. The "Catalog Tables" tbale in the UI was empty afterwards and the running MR job was failing fast. I synced to trunk and applied this patch and restarted the cluster. With the above patch two pieces are failing to update but they were the removed bloomfilter variables and it looks like it was removed in trunk already so no harm done.

    Will report if I run into the new log outputs or if anything else happens.
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: stack
    Priority: Critical
    Fix For: 0.20.0

    Attachments: 1569-v2.patch, sf.patch


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • ryan rawson (JIRA) at Jun 25, 2009 at 10:56 pm
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724333#action_12724333 ]

    ryan rawson commented on HBASE-1569:
    ------------------------------------

    +1 lgtm
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: stack
    Priority: Critical
    Fix For: 0.20.0

    Attachments: 1569-v2.patch, sf.patch


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • stack (JIRA) at Jun 25, 2009 at 11:58 pm
    [ https://issues.apache.org/jira/browse/HBASE-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    stack updated HBASE-1569:
    -------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    Committed
    rare race condition can take down a regionserver.
    --------------------------------------------------

    Key: HBASE-1569
    URL: https://issues.apache.org/jira/browse/HBASE-1569
    Project: Hadoop HBase
    Issue Type: Bug
    Affects Versions: 0.20.0
    Reporter: ryan rawson
    Assignee: stack
    Priority: Critical
    Fix For: 0.20.0

    Attachments: 1569-v2.patch, sf.patch


    this happened after > 24 hours of heavy import load on my cluster. Luckily the shutdown seemed to be clean:
    java.lang.IllegalAccessError: Call open first
    at org.apache.hadoop.hbase.regionserver.StoreFile.getReader(StoreFile.java:356)
    at org.apache.hadoop.hbase.regionserver.Store.getStorefilesIndexSize(Store.java:1378)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.doMetrics(HRegionServer.java:1075)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:454)
    at java.lang.Thread.run(Thread.java:619)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedJun 23, '09 at 7:44a
activeJun 25, '09 at 11:58p
posts10
users1
websitehbase.apache.org

1 user in discussion

stack (JIRA): 10 posts

People

Translate

site design / logo © 2022 Grokbase