On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao wrote:+1
Making input data and query results available in a short delay is definitely
a very attractive feature for Hive.
There are multiple approaches to achieve this, mainly depending on how much
we leverage HBase.
The simplest way to go is to probably have a good Hive/HBase integration
like HIVE-705, HIVE-806 etc.
This can help us leverage the efforts done by HBase to the maximum degree.
The potential drawback is that HBase tables have support for random writes
which may cause additional overhead for simple sequential writes.
Eventually we may (or may not) need our own HiveRegionServer which hosts
data in any format supported by Hive (on top of just the internal file
format supported by HBase), but I feel it might be a good start to first try
integrate the two.
Zheng
On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo wrote:After sitting though some HDFS/BHase presentations yesterday, I
started thinking. that the hive model or doing its map/reduce over raw
files from HDFS is great, but a dedicated caching/region server could
be a big benefit in answering real time queries.
I calculated that one data center (not counting non-cachable content)
could have about 378MB of logs a day. Going from facebooks information
here:
http://www.facebook.com/note.php?note_id=110207012002"The log files are named with the date and time of collection.
Individual hourly files are around 55 MB when compressed, so eight
months of compressed data takes up about 300 GB of space."
During the day and week the logs are collected one would expect the
data to be used very often. So having this in a cached would be ideal.
Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
could be sliced off and as a dedicated HiveRegion server, or it can
run as several dedicated servers. With maybe RAM and nothing else.
A Hive Region Server would/could contain HiveTables in a compressed
format, maybe hive tables in a derby format, indexes we are creating,
and some information about the usage so different caching algorithms
could evict sections. We could use ZooKeeper to manage the HiveRegions
like in HBase does.
Hive query optimizer would look to see if the in the data was in the
HiveRegionServer or run as normal.
Has anyone ever thought of this?
Edward
--
Yours,
Zheng
I agree that Hive/Hbase integration is a good thing. I think that the
differences between hive/hbase are vast. Hive is row oriented with
column support and HBase is column oriented. HBase is working on
sparse files and needs random inserts while Hive data is mostly Write
Once Read Many. HBase is working mostly in memory with a commit log
while Hive writes during the map and reduce phase directly to HDFS.
The way I look at there is already a lot of waist. Imagine jobs are
done simultaneously or right each other on relatively small data sets.
select name,count() from people group by name;
select * from people
select * from people where name='sarah'
With a HiveRegionServer sections of data might be already in memory in
a fast binary form, or on disk in a embedded db like the one used by
map side joins. Disks would be used on intermediate results rather
then reprocessing the same chunks data repeatedly.
Managing HiveRegionServers would be much less complex then managing
HBASE regions that have high random read/inserts. In its simple form
it would just be an exact duplicate of data, more complex an optimized
binary form of the data.