It appears to take 30 minutes or so for HBase to recover from the failure
of the regionserver holding the ROOT role. Please let me know what options
are available to more quickly recover from such a situation, as when this
happens our applications/SLAs are impacted.
It would also be good to be able to quickly recover from a failure of the
regionserver which owns the .META. table. During HBase startup, a random
server is elected to manage the ROOT and .META. tables (different servers).
This creates a single point of failure. At the very least, perhaps we can
find a way to force which server is selected for this role, perhaps even
just via startup order. We could then assign a server which doesn't
participate in flow tasks (no tasktracker), and so would be more stable.
There may also be a config option for this. Wondering if there is a way to
force election of a new ROOT/META owner within a minute or so instead of