I am researching what the best practice is for using HBase in user facing
applications. I do not know all of the applications that will be ported to
use HBase, but they do share common characteristics such as
- simple key / value data. Not serving large files ATM. Perhaps a couple
columns in a single column family
- very tall tables
- hundreds of millions of rows
- need millisecond access times for a single row
- random access
- maintain very, very good query times while loading in new data
The quick choice would be to use something like memcache or Redis, but the
data is growing faster than the memory of a single box or even few boxes.
We also have a significant investment in Hadoop technologies so keeping
HBase prime seems to make a lot of sense.
So, some questions:
1. do you find that having a single HBase cluster to serve all applications
vs smaller clusters to serve application specific data is better?
2. In the real world do people hook API's directly to HBase or is there some
caching layer that is used?
3. I remember hearing people like StumbleUpon use different clusters for
analytics vs customer apps. Is this still best practice?
4. Anyone using MSLAB's to reduce GC pauses in production? Experiences /
5. What other considerations have you found when hooking HBase up for
Thanks in advance and I'd love to hear some bragging!