My previous mail mentioned a bunch of design ideas that were mainly
lifted from Jeff Dean's BigTable talk. BigTable seems like a useful
way to do large-scale row storage, and their decisions largely seem
like the right ones.
However, BigTable still leaves some things on the table. Items to
improve include a query language and multi-row locking, among
Dean said explicitly in his talk that they wanted to avoid multirow
locking because it's complicated, error-prone, and maybe not necessary.
He's right on at least the first two, and maybe the third.
Multiple row locks are useful when you're making a change to
several rows that should be atomic; you want all the changes
or none of the changes. It's also used in traditional databases
if you want to perform an expensive read operation (like a
multiway join) and you want to make sure the results don't
get modified while you're reading.
Distributed lock acquisition is very hard to do. It's bug-prone
and often has very weird performance ramifications. It's
difficult to get working, difficult to tune, difficult to everything.
Here are a few ideas on what to do:
1) Suck it up and have the client acquire locks on multiple
HRegionServers simultaneously. All clients would have to
agree to acquire locks according to some global ordering to
avoid deadlock. HRegions would not be allowed to migrate
to a new server if locked.
If this is a rare circumstance, a better approach would be
to have a dedicated "lock acquirer" through which clients
make requests. It doesn't help the theoretical problem here,
but it would make debugging an awful lot easier.
2) In the case of long-lasting read operations, we can
use versioning to guarantee consistency. If each row is
annotated with an edit timestamp, and we know that there
is sufficient version history available, the long-lasting job
can run over a specific version only.
Edits can continue to be made to the database while the
read-only job is ongoing. The operation is performed over
the database as of the time the task was submitted.
3) In the case of multiple row updates, we may be able to
use different edit semantics to avoid locking. For example,
consider that we want to add a single column/value pair to
multiple rows. We want this to happen atomically, so that
both rows get the value or neither of them do so.
If it's just an add, then we don't need to lock the rows at
all; the add will always succeed, even if other writes
intervene. Traditionally there's been no difference between
among data "updates", so they all require locking. If we
can get a client to adjust the update semantics slightly,
then the locking can be much more relaxed.
I'd say that "add" or "append" semantics are likely to be
at least as common as "edit" semantics.
Can you think of the family of edit semantics you'd like
to see offered here?
Also, how useful do you think a general-purpose query language
would be for HBase? It would be fairly straightforward to implement,
for example, a poor man's version of SQL that has different locking
and update behavior (and which chucks out the more exotic elements).
This might be compiled into a piece of code that is executed
immediately, or it might be transformed into a long-lasting mapreduce
I have a few ideas for such a language, but I'm worried it's getting
a little far afield from what we're interested in for Hadoop.