FAQ
Is there any way/plan for Hive to take advantage of M/R's combine()
phrase? There can be either rules embedded in in the query optimizer or
hints passed by user...
GROUP BY should benefit from this alot..

Any comment?

Search Discussions

  • Raghu Murthy at Feb 26, 2009 at 2:03 pm
    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.

    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer  or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?

  • Zheng Shao at Feb 26, 2009 at 9:51 pm
    Hi Qing,

    We did think about Combiner when we started Hive. However earlier
    discussions lead us to believe that hash-based aggregation inside the mapper
    will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following
    before running the hive query:
    set hive.map.aggr=true;

    Zheng
    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:

    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.

    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?


    --
    Yours,
    Zheng
  • Qing Yan at Feb 27, 2009 at 1:58 am
    Got it.

    Does map side aggregation has any special requirement about the dataset?
    E.g. The number of unqiue group by keys could be too big to hold
    in memory. Will it still work?
    On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao wrote:

    Hi Qing,

    We did think about Combiner when we started Hive. However earlier
    discussions lead us to believe that hash-based aggregation inside the mapper
    will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following
    before running the hive query:
    set hive.map.aggr=true;

    Zheng

    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:

    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.

    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?


    --
    Yours,
    Zheng
  • Namit Jain at Feb 27, 2009 at 2:04 am
    Yes, it flushes the data when the hash table is occupying too much memory


    From: Qing Yan
    Sent: Thursday, February 26, 2009 5:58 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization

    Got it.

    Does map side aggregation has any special requirement about the dataset? E.g. The number of unqiue group by keys could be too big to hold in memory. Will it still work?
    On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao wrote:
    Hi Qing,

    We did think about Combiner when we started Hive. However earlier discussions lead us to believe that hash-based aggregation inside the mapper will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following before running the hive query:
    set hive.map.aggr=true;

    Zheng

    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:
    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.

    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?


    --
    Yours,
    Zheng
  • Qing Yan at Feb 27, 2009 at 8:13 am
    Ouch, I was getting tons of exceptions after turning on map-side
    aggregation:

    java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
    at java.lang.StringCoding.encode(StringCoding.java:272)
    at java.lang.String.getBytes(String.java:947)
    at
    org.apache.hadoop.hive.serde2.thrift.TBinarySortableProtocol.writeString(TBinarySortableProtocol.java:299)
    at
    org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeTypeString.serialize(DynamicSerDeTypeString.java:65)
    at
    org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeFieldList.serialize(DynamicSerDeFieldList.java:249)
    at
    org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeStructBase.serialize(DynamicSerDeStructBase.java:81)
    at
    org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.serialize(DynamicSerDe.java:174)
    at
    org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:153)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:306)
    at
    org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:564)
    at
    org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    at
    org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:552)
    at
    org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)

    java.io.IOException: Task process exit with nonzero status of 1.
    ...
    Just to confirm is this just a bug or is by design ?
    On Fri, Feb 27, 2009 at 10:02 AM, Namit Jain wrote:

    Yes, it flushes the data when the hash table is occupying too much memory





    *From:* Qing Yan
    *Sent:* Thursday, February 26, 2009 5:58 PM
    *To:* hive-user@hadoop.apache.org
    *Subject:* Re: Combine() optimization



    Got it.



    Does map side aggregation has any special requirement about the dataset?
    E.g. The number of unqiue group by keys could be too big to hold
    in memory. Will it still work?

    On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao wrote:

    Hi Qing,

    We did think about Combiner when we started Hive. However earlier
    discussions lead us to believe that hash-based aggregation inside the mapper
    will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following
    before running the hive query:
    set hive.map.aggr=true;

    Zheng



    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:

    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.


    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?



    --
    Yours,
    Zheng

  • Namit Jain at Feb 27, 2009 at 3:57 pm
    Look at the patch for


    http://issues.apache.org/jira/browse/HIVE-223



    It has not been committed yet.


    Thanks,
    -namit

    ________________________________________
    From: Qing Yan [qingyan@gmail.com]
    Sent: Friday, February 27, 2009 12:12 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization

    Ouch, I was getting tons of exceptions after turning on map-side aggregation:

    java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
    at java.lang.StringCoding.encode(StringCoding.java:272)
    at java.lang.String.getBytes(String.java:947)
    at org.apache.hadoop.hive.serde2.thrift.TBinarySortableProtocol.writeString(TBinarySortableProtocol.java:299)
    at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeTypeString.serialize(DynamicSerDeTypeString.java:65)
    at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeFieldList.serialize(DynamicSerDeFieldList.java:249)
    at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDeStructBase.serialize(DynamicSerDeStructBase.java:81)
    at org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe.serialize(DynamicSerDe.java:174)
    at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:153)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:306)
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:564)
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.forward(GroupByOperator.java:552)
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.close(GroupByOperator.java:582)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:263)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.close(ExecMapper.java:96)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)

    java.io.IOException: Task process exit with nonzero status of 1.
    ...
    Just to confirm is this just a bug or is by design ?
    On Fri, Feb 27, 2009 at 10:02 AM, Namit Jain wrote:

    Yes, it flushes the data when the hash table is occupying too much memory





    From: Qing Yan
    Sent: Thursday, February 26, 2009 5:58 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization



    Got it.



    Does map side aggregation has any special requirement about the dataset? E.g. The number of unqiue group by keys could be too big to hold in memory. Will it still work?

    On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao wrote:

    Hi Qing,

    We did think about Combiner when we started Hive. However earlier discussions lead us to believe that hash-based aggregation inside the mapper will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following before running the hive query:
    set hive.map.aggr=true;

    Zheng



    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:

    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.

    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?



    --
    Yours,
    Zheng
  • Scott Carey at Feb 27, 2009 at 9:41 pm
    Does it dump all contents and start over, or use a LRU or MFU algorithm? LinkedHashMap makes LRUs and similar constructs fairly easy to make.
    My guess is that most data types have biased value distributions that will take advantage of map side partial aggregation fairly well.


    On 2/26/09 6:02 PM, "Namit Jain" wrote:

    Yes, it flushes the data when the hash table is occupying too much memory



    From: Qing Yan
    Sent: Thursday, February 26, 2009 5:58 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization


    Got it.



    Does map side aggregation has any special requirement about the dataset? E.g. The number of unqiue group by keys could be too big to hold in memory. Will it still work?

    On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao wrote:
    Hi Qing,

    We did think about Combiner when we started Hive. However earlier discussions lead us to believe that hash-based aggregation inside the mapper will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following before running the hive query:
    set hive.map.aggr=true;

    Zheng



    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:
    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.


    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?

  • Namit Jain at Feb 27, 2009 at 9:59 pm
    It dumps 10% of the hash table randomly today


    From: Scott Carey
    Sent: Friday, February 27, 2009 1:41 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization

    Does it dump all contents and start over, or use a LRU or MFU algorithm? LinkedHashMap makes LRUs and similar constructs fairly easy to make.
    My guess is that most data types have biased value distributions that will take advantage of map side partial aggregation fairly well.


    On 2/26/09 6:02 PM, "Namit Jain" wrote:
    Yes, it flushes the data when the hash table is occupying too much memory



    From: Qing Yan
    Sent: Thursday, February 26, 2009 5:58 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization


    Got it.



    Does map side aggregation has any special requirement about the dataset? E.g. The number of unqiue group by keys could be too big to hold in memory. Will it still work?

    On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao wrote:
    Hi Qing,

    We did think about Combiner when we started Hive. However earlier discussions lead us to believe that hash-based aggregation inside the mapper will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following before running the hive query:
    set hive.map.aggr=true;

    Zheng



    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:
    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.


    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?

  • Joydeep Sen Sarma at Feb 27, 2009 at 10:23 pm
    Yeah - we definitely want to convert it to a MFU type flush algorithm.

    If someone wants to take a crack at it before we can get to it - that would be awesome

    ________________________________
    From: Namit Jain
    Sent: Friday, February 27, 2009 1:59 PM
    To: hive-user@hadoop.apache.org
    Subject: RE: Combine() optimization

    It dumps 10% of the hash table randomly today


    From: Scott Carey
    Sent: Friday, February 27, 2009 1:41 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization

    Does it dump all contents and start over, or use a LRU or MFU algorithm? LinkedHashMap makes LRUs and similar constructs fairly easy to make.
    My guess is that most data types have biased value distributions that will take advantage of map side partial aggregation fairly well.


    On 2/26/09 6:02 PM, "Namit Jain" wrote:
    Yes, it flushes the data when the hash table is occupying too much memory



    From: Qing Yan
    Sent: Thursday, February 26, 2009 5:58 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: Combine() optimization


    Got it.



    Does map side aggregation has any special requirement about the dataset? E.g. The number of unqiue group by keys could be too big to hold in memory. Will it still work?

    On Fri, Feb 27, 2009 at 5:50 AM, Zheng Shao wrote:
    Hi Qing,

    We did think about Combiner when we started Hive. However earlier discussions lead us to believe that hash-based aggregation inside the mapper will be as competitive as using combiner in most cases.

    In order to enable map-side aggregation, we just need to do the following before running the hive query:
    set hive.map.aggr=true;

    Zheng



    On Thu, Feb 26, 2009 at 6:03 AM, Raghu Murthy wrote:
    Right now Hive does not exploit the combiner. But hash-based map-side
    aggregation in hive (controlled by hints) provides a similar optimization.
    Using the combiner in addition to map-side aggregation should improve the
    performance even more if the combiner can further aggregate the partial
    aggregates generated from the mapper.


    On 2/26/09 5:57 AM, "Qing Yan" wrote:

    Is there any way/plan for Hive to take advantage of M/R's combine()
    phrase? There can be either rules embedded in in the query optimizer or hints
    passed by user...
    GROUP BY should benefit from this alot..

    Any comment?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedFeb 26, '09 at 1:57p
activeFeb 27, '09 at 10:23p
posts10
users6
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase