Do you also need to be able to append the new data to an existing partition?
From: Schubert Zhang
Sent: Wednesday, August 05, 2009 11:43 AM
Subject: Re: Questions for the future work of Hive
Regards to automatic-multi-partition insertion, is it the future stuff "Inserts without listing partitions"?
In our applications, we really want this feature, since our data will come into data warehouse continually and we cannot know which partition before read each row.
Regards to Hive backended by HBase, I think it can also store persistent data in HBase, with following advantages:
1. The placement of each row are handled by HBase.
2. The stored rows are sorted and indexed by HBase, and the index is a global table index.
3. The data in HBase can provide SQL query interface via Hive.
On Wed, Aug 5, 2009 at 3:26 PM, Zheng Shao wrote:
1) We have not started working on cost-based optimizer yet. Index is
one of the ongoing works on the performance side. We are working on a
couple more, e.g. more compact on-disk format (LazyBinarySerDehttps://issues.apache.org/jira/browse/HIVE-640
) which gives a nice
speed-up for queries with multiple map-reduce jobs.
2) We don't have a short-term plan for automatic-multi-partition
insertion. However there is a simple workaround if you know the
partition values (and Hive can do multiple inserts in a single
map-reduce job!). "src" can be a sub query as well.
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
ts = "2009-08-01"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
ts = "2009-08-02"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
ts = "2009-08-03"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
ts = "2009-08-04"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
ts = "2009-08-05"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
ts = "2009-08-06"
INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
ts = "2009-08-07";
There is some ongoing work for integrating HBase tables with Hive:https://issues.apache.org/jira/browse/HIVE-705
We won't know which storage backend is the best until we have them
done and tested, but at the least HBase looks very promising for
datasets that fit in the memory.
Here is the slides which contains examples for how to add new storage
backend (file format) to Hive:http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook
Hive is completely open and we hope Hive can have more storage
backends, because it's not likely that one storage backend will be the
best for all kinds of applications.
On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhangwrote:
In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
more detail of following tow points:
(1) Hive currently has a naive rule-based optimizer with a small number of
simple rules. We plan to build a cost-based optimizer and adaptive
optimization techniques to come up with more efficient plans.
Q: Is the ongoing work of "Indexing" the one of this improvement?
Q: Is there any more?
(2) We are exploring columnar storage and more intelligent data placement to
improve scan performance.
Q: We found that current Hive cannot place the data in different partitions
intelligently (we must specify the partition value in statements). Is the
intelligent/dynamic placement of partitions is one of this improvement? For
example, we have many input files which contain many records for diffenent
timestamp, and we want place each record into a proper partition according
to the timestamp colum.
Q: Do you think Bigtable/HBase is a good columnar storage which provides
good model of intelligent data placement?