I use JSON for exactly this. A simple row/column/timestamp
key leads to a compound structure encoding all of the object
attributes, or maybe arrays of objects, etc. At the scale
where HBase is an effective solution you need to
denormalize ("insert time join") for query efficiency anyhow,
and I can serve the results out as is. Most of the work then
is done in the mapreduce tasks that produce and store the
JSON encodings in batch. I also build several views of the
data into multiple tables -- materialized views basically.
At Hadoop/HBase scale, disk space is cheap, seek time is not.
Because of this query processing time is low enough that I
can serve them right out of HBase without needing an
intermediate caching layer such as memcached or Tokyo
Cabinet (jgray's favorite).
Subject: Re: what is considered as best / worst practice?
Date: Sunday, December 21, 2008, 6:07 AM
just as a temporary fix, you could also use something like
google protocol buffers or facebook's thrift for the data
modelling and only save the binary output in hbase.
You will however loose the ability to filter on columns or
only fetch the columns you are interested in, and must
always fetch all of the data related to an entity.