We (Inadco) are looking for users and developers to engage in our open
project code named, for the lack of better name, "hbase-lattice" in
order to mutually benefit and eventually develop a mature Hbase-based
BI real time OLAP-ish solution.
The basic premise is to use Cuboid Lattice -like model (hence the
name) to precompute cuboids manually specified (as opposed to
statistically inferred) and then build query optimization based on
those. (similarly to how DBA analyzes query use cases and decides
which indexes to build where). On top of it, we threw in both
declarative api and query language support (our reporting system and
realtime analytics in platform is actually using query language, to
facilitate our developers there).
We aim to answer aggregate queries over cube of facts with very short
hbase scans over the same projection table so analytical reply could
be produced in under 1 ms+ network overhead.
Another goal is short latency for fact availability. So we employ
incremental MR fact compilation (~ 3-5 mins since fact event average,
depending on the # of cuboids and cluster size).
At this point we have this project in production with minimum
capabilities we need, in kind of pilot mode still, but it is to
replace our previous manual incremental projection work on a permanent
basis very soon (as soon as we finish migrating our reporting).
There's a long todo list and certainly any given particular 3rd party
would likely find voids it would have liked to be filled in.
Right now we have integration only with Jasper reports tool but
eventually it shouldn't be too difficult to wrap the client into jdbc
contract as well and enable practically any use (it's just since we
integrate tightly both reporting and RT platforms we don't have a need
for jdbc per se at the moment).
The project readme document is here:
or here https://github.com/inadco/HBase-Lattice/blob/dev-0.1.x/docs/readme.pdf?raw=true
Aside from capabilities mentioned in this document, we also support
custom aggregate functions which we can plug directly into model
definiton. (so we can develop custom aggregate function set rather
quickly and easily, as it stands).
The cube update is incremental with a two-step (sequentially)
generated pig job (so perhaps compiler cycle could be ~5 minutes since
actual event). We process and aggregate our impression and click and
other fact streams with it. The model can be updated with some
backward compatibility conventions similar to protobuf conventions (as
in add stuff, but don't change type) and the changes could be pushed
into production system to take effect immediately henceforth without
any need to redeploy any code.
Operationally, i tested it for 1.2 bln/day rather wide event fact
streams packed as protobuf messages inside sequence files on 6 nodes,
and the compilation was not even breaking a sweat. Obviously, my data
highly aggregates over time dimensions which correspond to time of the
event, so hbase update load actually is pretty light due to high
degree of aggregation. But the biggest benefit is that one can scale
horizontally the number of facts handled per unit of time pretty
This is optimized for time series data for the most part, so
consequently one will see very limited and time oriented support for
dimension types and hierarchies at the moment.
Generally i think the need is very common for BI solutions over big
data time series such as impression or request logs but surprisingly I
did not find a well-maintained hbase solution for that (although i did
see either stale or less capable attempts out there -- i certainly
have missed stuff floating around), so hence.
We are planning to maintain the project for a long time as a part of
our production system. Please email me if there's interest as either
user or collaborator. I think i saw a couple of emails on this list
looking for a solution for a similar problem.
This is partly inspired by and intended as a complementary solution to
Tsuna's Open TSDB (so big thanks to StumbleUpon's people for an
example of how to handle time series data).