Grokbase Groups HBase user May 2016
FAQ
We had a regionserver fall out of our cluster, I assume due to the process
hitting a limit as the region servers .out log file just contained "Killed"
which I've experienced when hitting open file descriptors limits. After
this, hbck then reported inconsistencies in tables:

ERROR: There is a hole in the region chain between
dce998f6f8c63d3515a3207330697ce4-ravi teja and e4. You need to create a
new .regioninfo and region dir in hdfs to plug the hole.

`hdfs fsck` reports a healthy dfs.

I attempted to run `hbase hbck -repairHoles` which didn't resolve the
inconsistencies.

I then restarted the HBase cluster and it now appears from looking at the
master log files that there are many tasks waiting to complete, and the web
interface results in a timeout:

master.SplitLogManager: total tasks = 299 unassigned = 285 tasks={ ... }

From looking at the logs on the regionservers I see messages such as:
"regionserver.SplitLogWorker: Current region server ... has 2 tasks in
progress and can't take more".

How can I speed up working through these tasks? I suspect our nodes can
handle many more that 2 tasks at a time. I'll likely have followup
questions ones these have been worked through but I think that's it for not.

Any other information you need?

Search Discussions

  • Michael Stack at May 29, 2016 at 11:59 pm

    On Fri, May 27, 2016 at 9:37 AM, Harry Waye wrote:

    We had a regionserver fall out of our cluster, I assume due to the process
    hitting a limit as the region servers .out log file just contained "Killed"
    which I've experienced when hitting open file descriptors limits. After
    this, hbck then reported inconsistencies in tables:
    Or kernel is killing the process because it is out of memory (no swapping
    but all memory occupied by running processes)

    ERROR: There is a hole in the region chain between
    dce998f6f8c63d3515a3207330697ce4-ravi teja and e4. You need to create a
    new .regioninfo and region dir in hdfs to plug the hole.

    `hdfs fsck` reports a healthy dfs.

    I attempted to run `hbase hbck -repairHoles` which didn't resolve the
    inconsistencies.

    I then restarted the HBase cluster and it now appears from looking at the
    master log files that there are many tasks waiting to complete, and the web
    interface results in a timeout:

    master.SplitLogManager: total tasks = 299 unassigned = 285 tasks={ ... }
    We are trying to split WAL files before cluster comes back on line it
    seems. Are we stuck on one WAL?


    From looking at the logs on the regionservers I see messages such as:
    "regionserver.SplitLogWorker: Current region server ... has 2 tasks in
    progress and can't take more".
    There is a configuration which says how many tasks per regionserver:
    "hbase.regionserver.wal.max.splitters"



    How can I speed up working through these tasks? I suspect our nodes can
    handle many more that 2 tasks at a time. I'll likely have followup
    questions ones these have been worked through but I think that's it for
    not.
    Did your cluster recover? Or is there a bad WAL in the way? One damaged
    somehow by the kill (perhaps other than RSs are getting killed on your
    possibly oversubscribed cluster)?

    Yours,
    St.

    Any other information you need?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMay 27, '16 at 4:39p
activeMay 29, '16 at 11:59p
posts2
users2
websitehbase.apache.org

2 users in discussion

Michael Stack: 1 post Harry Waye: 1 post

People

Translate

site design / logo © 2018 Grokbase