Grokbase Groups Hive user August 2009
FAQ
In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
more detail of following tow points:

(1) Hive currently has a naive rule-based optimizer with a small number of
simple rules. We plan to build a cost-based optimizer and adaptive
optimization techniques to come up with more efficient plans.
Q: Is the ongoing work of "Indexing" the one of this improvement?
Q: Is there any more?
(2) We are exploring columnar storage and more intelligent data placement to
improve scan performance.
Q: We found that current Hive cannot place the data in different partitions
intelligently (we must specify the partition value in statements). Is the
intelligent/dynamic placement of partitions is one of this improvement? For
example, we have many input files which contain many records for diffenent
timestamp, and we want place each record into a proper partition according
to the timestamp colum.
Q: Do you think Bigtable/HBase is a good columnar storage which provides
good model of intelligent data placement?

Schubert

Search Discussions

  • Zheng Shao at Aug 5, 2009 at 7:27 am
    1) We have not started working on cost-based optimizer yet. Index is
    one of the ongoing works on the performance side. We are working on a
    couple more, e.g. more compact on-disk format (LazyBinarySerDe
    https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
    speed-up for queries with multiple map-reduce jobs.

    2) We don't have a short-term plan for automatic-multi-partition
    insertion. However there is a simple workaround if you know the
    partition values (and Hive can do multiple inserts in a single
    map-reduce job!). "src" can be a sub query as well.
    FROM src
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
    ts = "2009-08-01"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
    ts = "2009-08-02"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
    ts = "2009-08-03"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
    ts = "2009-08-04"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
    ts = "2009-08-05"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
    ts = "2009-08-06"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
    ts = "2009-08-07";

    There is some ongoing work for integrating HBase tables with Hive:
    https://issues.apache.org/jira/browse/HIVE-705
    We won't know which storage backend is the best until we have them
    done and tested, but at the least HBase looks very promising for
    datasets that fit in the memory.

    Here is the slides which contains examples for how to add new storage
    backend (file format) to Hive:
    http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook page
    Hive is completely open and we hope Hive can have more storage
    backends, because it's not likely that one storage backend will be the
    best for all kinds of applications.

    Zheng

    On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhangwrote:
    In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
    Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
    more detail of following tow points:
    (1) Hive currently has a naive rule-based optimizer with a small number of
    simple rules. We plan to build a cost-based optimizer and adaptive
    optimization techniques to come up with more efficient plans.
    Q: Is the ongoing work of "Indexing" the one of this improvement?
    Q: Is there any more?
    (2) We are exploring columnar storage and more intelligent data placement to
    improve scan performance.
    Q: We found that current Hive cannot place the data in different partitions
    intelligently (we must specify the partition value in statements). Is the
    intelligent/dynamic placement of partitions is one of this improvement? For
    example, we have many input files which contain many records for diffenent
    timestamp, and we want place each record into a proper partition according
    to the timestamp colum.
    Q: Do you think Bigtable/HBase is a good columnar storage which provides
    good model of intelligent data placement?
    Schubert


    --
    Yours,
    Zheng
  • Schubert Zhang at Aug 5, 2009 at 6:43 pm
    Thanks Zheng.

    Regards to automatic-multi-partition insertion, is it the future stuff
    "Inserts without listing partitions"?
    In our applications, we really want this feature, since our data will come
    into data warehouse continually and we cannot know which partition before
    read each row.

    Regards to Hive backended by HBase, I think it can also store persistent
    data in HBase, with following advantages:
    1. The placement of each row are handled by HBase.
    2. The stored rows are sorted and indexed by HBase, and the index is a
    global table index.
    3. The data in HBase can provide SQL query interface via Hive.
    Schubert
    On Wed, Aug 5, 2009 at 3:26 PM, Zheng Shao wrote:

    1) We have not started working on cost-based optimizer yet. Index is
    one of the ongoing works on the performance side. We are working on a
    couple more, e.g. more compact on-disk format (LazyBinarySerDe
    https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
    speed-up for queries with multiple map-reduce jobs.

    2) We don't have a short-term plan for automatic-multi-partition
    insertion. However there is a simple workaround if you know the
    partition values (and Hive can do multiple inserts in a single
    map-reduce job!). "src" can be a sub query as well.
    FROM src
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
    ts = "2009-08-01"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
    ts = "2009-08-02"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
    ts = "2009-08-03"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
    ts = "2009-08-04"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
    ts = "2009-08-05"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
    ts = "2009-08-06"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
    ts = "2009-08-07";

    There is some ongoing work for integrating HBase tables with Hive:
    https://issues.apache.org/jira/browse/HIVE-705
    We won't know which storage backend is the best until we have them
    done and tested, but at the least HBase looks very promising for
    datasets that fit in the memory.

    Here is the slides which contains examples for how to add new storage
    backend (file format) to Hive:
    http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebookpage
    Hive is completely open and we hope Hive can have more storage
    backends, because it's not likely that one storage backend will be the
    best for all kinds of applications.

    Zheng

    On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhangwrote:
    In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
    Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
    more detail of following tow points:
    (1) Hive currently has a naive rule-based optimizer with a small number of
    simple rules. We plan to build a cost-based optimizer and adaptive
    optimization techniques to come up with more efficient plans.
    Q: Is the ongoing work of "Indexing" the one of this improvement?
    Q: Is there any more?
    (2) We are exploring columnar storage and more intelligent data placement to
    improve scan performance.
    Q: We found that current Hive cannot place the data in different
    partitions
    intelligently (we must specify the partition value in statements). Is the
    intelligent/dynamic placement of partitions is one of this improvement? For
    example, we have many input files which contain many records for diffenent
    timestamp, and we want place each record into a proper partition according
    to the timestamp colum.
    Q: Do you think Bigtable/HBase is a good columnar storage which provides
    good model of intelligent data placement?
    Schubert


    --
    Yours,
    Zheng
  • Ashish Thusoo at Aug 5, 2009 at 7:08 pm
    Do you also need to be able to append the new data to an existing partition?

    Ashish

    ________________________________
    From: Schubert Zhang
    Sent: Wednesday, August 05, 2009 11:43 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: Questions for the future work of Hive

    Thanks Zheng.

    Regards to automatic-multi-partition insertion, is it the future stuff "Inserts without listing partitions"?
    In our applications, we really want this feature, since our data will come into data warehouse continually and we cannot know which partition before read each row.

    Regards to Hive backended by HBase, I think it can also store persistent data in HBase, with following advantages:
    1. The placement of each row are handled by HBase.
    2. The stored rows are sorted and indexed by HBase, and the index is a global table index.
    3. The data in HBase can provide SQL query interface via Hive.
    Schubert
    On Wed, Aug 5, 2009 at 3:26 PM, Zheng Shao wrote:
    1) We have not started working on cost-based optimizer yet. Index is
    one of the ongoing works on the performance side. We are working on a
    couple more, e.g. more compact on-disk format (LazyBinarySerDe
    https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
    speed-up for queries with multiple map-reduce jobs.

    2) We don't have a short-term plan for automatic-multi-partition
    insertion. However there is a simple workaround if you know the
    partition values (and Hive can do multiple inserts in a single
    map-reduce job!). "src" can be a sub query as well.
    FROM src
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
    ts = "2009-08-01"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
    ts = "2009-08-02"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
    ts = "2009-08-03"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
    ts = "2009-08-04"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
    ts = "2009-08-05"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
    ts = "2009-08-06"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
    ts = "2009-08-07";

    There is some ongoing work for integrating HBase tables with Hive:
    https://issues.apache.org/jira/browse/HIVE-705
    We won't know which storage backend is the best until we have them
    done and tested, but at the least HBase looks very promising for
    datasets that fit in the memory.

    Here is the slides which contains examples for how to add new storage
    backend (file format) to Hive:
    http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook page
    Hive is completely open and we hope Hive can have more storage
    backends, because it's not likely that one storage backend will be the
    best for all kinds of applications.

    Zheng

    On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhangwrote:
    In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
    Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
    more detail of following tow points:
    (1) Hive currently has a naive rule-based optimizer with a small number of
    simple rules. We plan to build a cost-based optimizer and adaptive
    optimization techniques to come up with more efficient plans.
    Q: Is the ongoing work of "Indexing" the one of this improvement?
    Q: Is there any more?
    (2) We are exploring columnar storage and more intelligent data placement to
    improve scan performance.
    Q: We found that current Hive cannot place the data in different partitions
    intelligently (we must specify the partition value in statements). Is the
    intelligent/dynamic placement of partitions is one of this improvement? For
    example, we have many input files which contain many records for diffenent
    timestamp, and we want place each record into a proper partition according
    to the timestamp colum.
    Q: Do you think Bigtable/HBase is a good columnar storage which provides
    good model of intelligent data placement?
    Schubert


    --
    Yours,
    Zheng
  • Schubert Zhang at Aug 6, 2009 at 4:01 am
    Ashish,

    Yes, we need append new data to a existing partition.

    I think the approach in Zheng Shao's reply to place different rows into
    different partitions is ineffective, since we must do many SELECT ....
    WHERE.... mapreduce jobs. And in many times, we cannot list the partitions
    in the source dataset.

    I my project, we have a experience to implement a mapreduce job to achieve
    it, but it is very specific (We have not found a good way to generalize
    it.). Following is what we done:
    (1) Sort rows by key = PartitionColumn+TheKeyColumnToSort
    (2) Estimate the partition changes in the MyOutputFormat to write to
    different files in different partitions.

    Schubert

    On Thu, Aug 6, 2009 at 3:08 AM, Ashish Thusoo wrote:

    Do you also need to be able to append the new data to an
    existing partition?

    Ashish

    ------------------------------
    *From:* Schubert Zhang
    *Sent:* Wednesday, August 05, 2009 11:43 AM
    *To:* hive-user@hadoop.apache.org
    *Subject:* Re: Questions for the future work of Hive

    Thanks Zheng.

    Regards to automatic-multi-partition insertion, is it the future stuff
    "Inserts without listing partitions"?
    In our applications, we really want this feature, since our data will come
    into data warehouse continually and we cannot know which partition before
    read each row.

    Regards to Hive backended by HBase, I think it can also store persistent
    data in HBase, with following advantages:
    1. The placement of each row are handled by HBase.
    2. The stored rows are sorted and indexed by HBase, and the index is a
    global table index.
    3. The data in HBase can provide SQL query interface via Hive.
    Schubert
    On Wed, Aug 5, 2009 at 3:26 PM, Zheng Shao wrote:

    1) We have not started working on cost-based optimizer yet. Index is
    one of the ongoing works on the performance side. We are working on a
    couple more, e.g. more compact on-disk format (LazyBinarySerDe
    https://issues.apache.org/jira/browse/HIVE-640 ) which gives a nice
    speed-up for queries with multiple map-reduce jobs.

    2) We don't have a short-term plan for automatic-multi-partition
    insertion. However there is a simple workaround if you know the
    partition values (and Hive can do multiple inserts in a single
    map-reduce job!). "src" can be a sub query as well.
    FROM src
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
    ts = "2009-08-01"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
    ts = "2009-08-02"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-03") SELECT * WHERE
    ts = "2009-08-03"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-04") SELECT * WHERE
    ts = "2009-08-04"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-05") SELECT * WHERE
    ts = "2009-08-05"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-06") SELECT * WHERE
    ts = "2009-08-06"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-07") SELECT * WHERE
    ts = "2009-08-07";

    There is some ongoing work for integrating HBase tables with Hive:
    https://issues.apache.org/jira/browse/HIVE-705
    We won't know which storage backend is the best until we have them
    done and tested, but at the least HBase looks very promising for
    datasets that fit in the memory.

    Here is the slides which contains examples for how to add new storage
    backend (file format) to Hive:
    http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebookpage
    Hive is completely open and we hope Hive can have more storage
    backends, because it's not likely that one storage backend will be the
    best for all kinds of applications.

    Zheng

    On Wed, Aug 5, 2009 at 12:06 AM, Schubert Zhangwrote:
    In the Hive paper <Hive - A Warehousing Solution Over a MapReduce
    Framework>, the section 5 describes the FUTURE WORK of Hive. I want to get
    more detail of following tow points:
    (1) Hive currently has a naive rule-based optimizer with a small number of
    simple rules. We plan to build a cost-based optimizer and adaptive
    optimization techniques to come up with more efficient plans.
    Q: Is the ongoing work of "Indexing" the one of this improvement?
    Q: Is there any more?
    (2) We are exploring columnar storage and more intelligent data
    placement to
    improve scan performance.
    Q: We found that current Hive cannot place the data in different
    partitions
    intelligently (we must specify the partition value in statements). Is the
    intelligent/dynamic placement of partitions is one of this improvement? For
    example, we have many input files which contain many records for diffenent
    timestamp, and we want place each record into a proper partition according
    to the timestamp colum.
    Q: Do you think Bigtable/HBase is a good columnar storage which provides
    good model of intelligent data placement?
    Schubert


    --
    Yours,
    Zheng
  • Andraz Tori at Aug 10, 2009 at 8:12 am

    2) We don't have a short-term plan for automatic-multi-partition
    insertion. However there is a simple workaround if you know the
    partition values (and Hive can do multiple inserts in a single
    map-reduce job!). "src" can be a sub query as well.
    FROM src
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
    ts = "2009-08-01"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
    ts = "2009-08-02"
    --------------------------------------------------------

    In my case src too is partitioned by "ts", which means that two mappings should take place at the same time since the data is independant, but Hive (0.3) produces a linear partition-by-partition job sequence.
    I also do group by inside every insert...


    Any ideas?

    [this, together with the fact that hive --service thriftserver (at least in 0.3) doesn't support multiple clients, makes it very hard to effectively run some queries.




    --
    Andraz Tori, CTO
    Zemanta Ltd, New York, London, Ljubljana
    www.zemanta.com
    mail: andraz@zemanta.com
    tel: +386 41 515 767
    twitter: andraz, skype: minmax_test
  • Ashish Thusoo at Aug 10, 2009 at 7:35 pm
    Hive trunk has support for multi group by which performs better than what 0.3.0 does.

    I did not completely understand your comment on "the two mappings should take place at the same time"..

    Can you elaborate?

    Ashish

    -----Original Message-----
    From: Andraz Tori
    Sent: Monday, August 10, 2009 1:11 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: Questions for the future work of Hive
    2) We don't have a short-term plan for automatic-multi-partition
    insertion. However there is a simple workaround if you know the
    partition values (and Hive can do multiple inserts in a single
    map-reduce job!). "src" can be a sub query as well.
    FROM src
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-01") SELECT * WHERE
    ts = "2009-08-01"
    INSERT OVERWRITE TABLE tgt PARTITION(pcol="2009-08-02") SELECT * WHERE
    ts = "2009-08-02"
    --------------------------------------------------------

    In my case src too is partitioned by "ts", which means that two mappings should take place at the same time since the data is independant, but Hive (0.3) produces a linear partition-by-partition job sequence.
    I also do group by inside every insert...


    Any ideas?

    [this, together with the fact that hive --service thriftserver (at least in 0.3) doesn't support multiple clients, makes it very hard to effectively run some queries.




    --
    Andraz Tori, CTO
    Zemanta Ltd, New York, London, Ljubljana www.zemanta.com
    mail: andraz@zemanta.com
    tel: +386 41 515 767
    twitter: andraz, skype: minmax_test

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedAug 5, '09 at 7:06a
activeAug 10, '09 at 7:35p
posts7
users4
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase