Grokbase Groups Hive user July 2009
FAQ
In our hive instance, we have one large fact-type table that joins to several dimension tables on integer keys. I know from reading the Language Manual that in ordering joins it is best to join the largest table last in the sequence in order to minimize memory usage. This won't work in the situation where you want to join the large fact table to more than one dimension. Something like:

select ... from small_table1 join big_table on ... join small_table2 on ...

I have to imagine this is a pretty common pattern, is there any guidance for doing this sort of star schema join?

Search Discussions

  • Namit Jain at Jul 14, 2009 at 5:45 am
    The large table is only a problem if the number of values for a given key are very large - they are stored in memory.

    If you dimension tables are small, you can use map-join. That way, no reduction is needed. You need to specify the hint in the select clause
    and the list of tables which are small.


    For eg:


    select /*+ MAPJOIN(smalll_table1, small_table2) */ ... from small_table1 join big_table on ... join small_table2 on ...


    Thanks,
    -namit


    On 7/13/09 9:44 PM, "Jason Michael" wrote:

    In our hive instance, we have one large fact-type table that joins to several dimension tables on integer keys. I know from reading the Language Manual that in ordering joins it is best to join the largest table last in the sequence in order to minimize memory usage. This won't work in the situation where you want to join the large fact table to more than one dimension. Something like:

    select ... from small_table1 join big_table on ... join small_table2 on ...

    I have to imagine this is a pretty common pattern, is there any guidance for doing this sort of star schema join?
  • He Yongqiang at Jul 14, 2009 at 12:51 pm
    mapside join is the most efficient, but as Namit mentioned, it also has some
    limitation.
    If you dimension tables are small, you can use map-join.
    Can use temporary table for map-join if a dimension table is not small but
    predicates on that table can dramatically reduce number of rows.

    Another optimization you can experiment is so called “index push down join”.
    Since hive does not support index ( and of course no bitmap/bitvector) right
    now, you can mimic the techniques by yourself. The key is to pre-execute
    several join tables, and to find some ways first join small tables and then
    join with the fact table.

    create table dimension1_fact(d1_key, fact_key, attributes of dimension table
    d1 which are mostly used in queries );
    create table dimension2_fact(d2_key, fact_key, attributes of dimension table
    d2 which are mostly used in queries );
    etc ...
    And you can translate you queries to these manually created tables.
    The query looks like:
    Select .... from big_table join (select factkey,... From dimension1_fact
    join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
    big_table.key = join_dimension_table .factkey
    Can not sure this can reduce the execution time, and may increase the
    execution time (you can do some experiments :) ).

    On 09-7-14 下午1:45, "Namit Jain" wrote:

    The large table is only a problem if the number of values for a given key are
    very large – they are stored in memory.

    If you dimension tables are small, you can use map-join. That way, no
    reduction is needed. You need to specify the hint in the select clause
    and the list of tables which are small.


    For eg:


    select /*+ MAPJOIN(smalll_table1, small_table2) */ ... from small_table1 join
    big_table on ... join small_table2 on ...


    Thanks,
    -namit

    On 7/13/09 9:44 PM, "Jason Michael" wrote:

    In our hive instance, we have one large fact-type table that joins to several
    dimension tables on integer keys. I know from reading the Language Manual
    that in ordering joins it is best to join the largest table last in the
    sequence in order to minimize memory usage. This won’t work in the situation
    where you want to join the large fact table to more than one dimension.
    Something like:

    select ... from small_table1 join big_table on ... join small_table2 on ...

    I have to imagine this is a pretty common pattern, is there any guidance for
    doing this sort of star schema join?
  • David Lerman at Jul 14, 2009 at 3:12 pm
    Thanks. Namit, map-join looks promising. He, I'm not quite sure I'm
    following yet.

    In our (simplified) case, we have a fact table with many millions of rows
    (call it f), which joins to two dimension tables (call them d1 and d2), each
    of which has at most a few thousand rows.

    So:
    create table f(f_key int, d1_key int, d2_key int, measure int);
    create table d1(d1_key int, d1_value string);
    create table d2(d2_key int, d2_value string);

    The facts are split among the dimensions extremely disproportionately, so
    although d2 has 1000 rows, 90% of the facts are linked to the top 10.

    The query is:
    select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
    f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

    We're finding that though we have 100 reducers, 98 of them are executing
    very quickly while 2 get stuck with the vast majority of the reduce records
    and crash with out-of-memory exceptions even though they have multiple gigs
    of memory allocated.

    This raises two questions:

    1. With this kind of join (small x big x small), what's the best way to
    keep the memory usage down? Map-join definitely looks promising, and He,
    I'd like to understand the index push down join approach so we can try it
    out as well.

    2. I'm assuming we're getting those 2 reducers stuck with the whole load
    because the reduce records get partitioned by the join key and the vast
    majority of records use just a few keys. Is there anything we can do about
    this?

    Thanks so much!
    Dave
    On 7/14/09 8:51 AM, "He Yongqiang" wrote:

    mapside join is the most efficient, but as Namit mentioned, it also has some
    limitation.
    If you dimension tables are small, you can use map-join.
    Can use temporary table for map-join if a dimension table is not small but
    predicates on that table can dramatically reduce number of rows.

    Another optimization you can experiment is so called “index push down join”.
    Since hive does not support index ( and of course no bitmap/bitvector) right
    now, you can mimic the techniques by yourself. The key is to pre-execute
    several join tables, and to find some ways first join small tables and then
    join with the fact table.

    create table dimension1_fact(d1_key, fact_key, attributes of dimension table
    d1 which are mostly used in queries );
    create table dimension2_fact(d2_key, fact_key, attributes of dimension table
    d2 which are mostly used in queries );
    etc ...
    And you can translate you queries to these manually created tables.
    The query looks like:
    Select .... from big_table join (select factkey,... From dimension1_fact
    join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
    big_table.key = join_dimension_table .factkey
    Can not sure this can reduce the execution time, and may increase the
    execution time (you can do some experiments :) ).

    On 09-7-14 下午1:45, "Namit Jain" wrote:

    The large table is only a problem if the number of values for a given key are
    very large – they are stored in memory.

    If you dimension tables are small, you can use map-join. That way, no
    reduction is needed. You need to specify the hint in the select clause
    and the list of tables which are small.


    For eg:


    select /*+ MAPJOIN(smalll_table1, small_table2) */ ... from small_table1
    join
    big_table on ... join small_table2 on ...


    Thanks,
    -namit

    On 7/13/09 9:44 PM, "Jason Michael" wrote:

    In our hive instance, we have one large fact-type table that joins to
    several
    dimension tables on integer keys. I know from reading the Language Manual
    that in ordering joins it is best to join the largest table last in the
    sequence in order to minimize memory usage. This won’t work in the
    situation
    where you want to join the large fact table to more than one dimension.
    Something like:

    select ... from small_table1 join big_table on ... join small_table2 on ...

    I have to imagine this is a pretty common pattern, is there any guidance for
    doing this sort of star schema join?
  • Namit Jain at Jul 14, 2009 at 3:22 pm
    You are doing 2 joins which are not getting merged because the join keys are different.


    1. Map Join is the best way to get around this, both the dimension tables will be in memory. This should work for you.
    2. In order to make sure that fact table is the outer most table, break up the queries:

    select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
    f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

    Change it to:

    Insert overwrite table tmp1
    select * from d1 join f on d1.d1_key = f.d1_key;

    select tmp1.d1_value, d2.d2_value, sum(measure) from
    d2 join tmp1 on d2.d2_key = tmp1.d2_key group by tmp1.d1_value, d2.d2_value;


    -namit


    On 7/14/09 8:11 AM, "David Lerman" wrote:

    Thanks. Namit, map-join looks promising. He, I'm not quite sure I'm
    following yet.

    In our (simplified) case, we have a fact table with many millions of rows
    (call it f), which joins to two dimension tables (call them d1 and d2), each
    of which has at most a few thousand rows.

    So:
    create table f(f_key int, d1_key int, d2_key int, measure int);
    create table d1(d1_key int, d1_value string);
    create table d2(d2_key int, d2_value string);

    The facts are split among the dimensions extremely disproportionately, so
    although d2 has 1000 rows, 90% of the facts are linked to the top 10.

    The query is:
    select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
    f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

    We're finding that though we have 100 reducers, 98 of them are executing
    very quickly while 2 get stuck with the vast majority of the reduce records
    and crash with out-of-memory exceptions even though they have multiple gigs
    of memory allocated.

    This raises two questions:

    1. With this kind of join (small x big x small), what's the best way to
    keep the memory usage down? Map-join definitely looks promising, and He,
    I'd like to understand the index push down join approach so we can try it
    out as well.

    2. I'm assuming we're getting those 2 reducers stuck with the whole load
    because the reduce records get partitioned by the join key and the vast
    majority of records use just a few keys. Is there anything we can do about
    this?

    Thanks so much!
    Dave
    On 7/14/09 8:51 AM, "He Yongqiang" wrote:

    mapside join is the most efficient, but as Namit mentioned, it also has some
    limitation.
    If you dimension tables are small, you can use map-join.
    Can use temporary table for map-join if a dimension table is not small but
    predicates on that table can dramatically reduce number of rows.

    Another optimization you can experiment is so called “index push down join”.
    Since hive does not support index ( and of course no bitmap/bitvector) right
    now, you can mimic the techniques by yourself. The key is to pre-execute
    several join tables, and to find some ways first join small tables and then
    join with the fact table.

    create table dimension1_fact(d1_key, fact_key, attributes of dimension table
    d1 which are mostly used in queries );
    create table dimension2_fact(d2_key, fact_key, attributes of dimension table
    d2 which are mostly used in queries );
    etc ...
    And you can translate you queries to these manually created tables.
    The query looks like:
    Select .... from big_table join (select factkey,... From dimension1_fact
    join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
    big_table.key = join_dimension_table .factkey
    Can not sure this can reduce the execution time, and may increase the
    execution time (you can do some experiments :) ).

    On 09-7-14 下午1:45, "Namit Jain" wrote:

    The large table is only a problem if the number of values for a given key are
    very large – they are stored in memory.

    If you dimension tables are small, you can use map-join. That way, no
    reduction is needed. You need to specify the hint in the select clause
    and the list of tables which are small.


    For eg:


    select /*+ MAPJOIN(smalll_table1, small_table2) */ ... from small_table1
    join
    big_table on ... join small_table2 on ...


    Thanks,
    -namit

    On 7/13/09 9:44 PM, "Jason Michael" wrote:

    In our hive instance, we have one large fact-type table that joins to
    several
    dimension tables on integer keys. I know from reading the Language Manual
    that in ordering joins it is best to join the largest table last in the
    sequence in order to minimize memory usage. This won’t work in the
    situation
    where you want to join the large fact table to more than one dimension.
    Something like:

    select ... from small_table1 join big_table on ... join small_table2 on ...

    I have to imagine this is a pretty common pattern, is there any guidance for
    doing this sort of star schema join?
  • He Yongqiang at Jul 14, 2009 at 4:02 pm
    If the dimension table only got a few thousand rows, then map side join will
    be the best.

    What I was saying is not real index push down join, you can refer to
    http://portal.acm.org/citation.cfm?id=564691.564754 for more details about
    index push down join.
    My previous description of index push down join is not suitable for your
    situation:
    1) the dimension table only got a few thousand rows.
    2) your fact/dimension table is not fat at all.
    3) there is no predicate for dimension table

    And an example to explain my previous description:
    create table f(f_key int, d1_key int, d2_key int, measure int);
    create table d1(d1_key int, d1_value string);
    create table d2(d2_key int, d2_value string);

    create several view table(these tables can be shared among many related
    queries):
    create table f_d1(f_key int, d1_key int, d1_attr...);
    create table f_d2(f_key int, d2_key int, d2_attr...);

    First do f_d1 join f_d2, then join with the big fat fact table. (the big
    table is at last and only scan one time).
    It assumes f_d1 and f_d2 are much smaller than fact table and dimension
    table, which does not hold in your situation, they become much bigger :( .
    On 09-7-14 下午11:11, "David Lerman" wrote:

    Thanks. Namit, map-join looks promising. He, I'm not quite sure I'm
    following yet.

    In our (simplified) case, we have a fact table with many millions of rows
    (call it f), which joins to two dimension tables (call them d1 and d2), each
    of which has at most a few thousand rows.

    So:
    create table f(f_key int, d1_key int, d2_key int, measure int);
    create table d1(d1_key int, d1_value string);
    create table d2(d2_key int, d2_value string);

    The facts are split among the dimensions extremely disproportionately, so
    although d2 has 1000 rows, 90% of the facts are linked to the top 10.

    The query is:
    select d1.d1_value, d2.d2_value, sum(measure) from d1 join f on d1.d1_key =
    f.d1_key join d2 on d2.d2_key = f.d2_key group by d1.d1_value, d2.d2_value;

    We're finding that though we have 100 reducers, 98 of them are executing
    very quickly while 2 get stuck with the vast majority of the reduce records
    and crash with out-of-memory exceptions even though they have multiple gigs
    of memory allocated.

    This raises two questions:

    1. With this kind of join (small x big x small), what's the best way to
    keep the memory usage down? Map-join definitely looks promising, and He,
    I'd like to understand the index push down join approach so we can try it
    out as well.

    2. I'm assuming we're getting those 2 reducers stuck with the whole load
    because the reduce records get partitioned by the join key and the vast
    majority of records use just a few keys. Is there anything we can do about
    this?

    Thanks so much!
    Dave
    On 7/14/09 8:51 AM, "He Yongqiang" wrote:

    mapside join is the most efficient, but as Namit mentioned, it also has some
    limitation.
    If you dimension tables are small, you can use map-join.
    Can use temporary table for map-join if a dimension table is not small but
    predicates on that table can dramatically reduce number of rows.

    Another optimization you can experiment is so called “index push down join”.
    Since hive does not support index ( and of course no bitmap/bitvector) right
    now, you can mimic the techniques by yourself. The key is to pre-execute
    several join tables, and to find some ways first join small tables and then
    join with the fact table.

    create table dimension1_fact(d1_key, fact_key, attributes of dimension table
    d1 which are mostly used in queries );
    create table dimension2_fact(d2_key, fact_key, attributes of dimension table
    d2 which are mostly used in queries );
    etc ...
    And you can translate you queries to these manually created tables.
    The query looks like:
    Select .... from big_table join (select factkey,... From dimension1_fact
    join dimension2_fact on d1_key=d2.key where .....) join_dimension_table on
    big_table.key = join_dimension_table .factkey
    Can not sure this can reduce the execution time, and may increase the
    execution time (you can do some experiments :) ).

    On 09-7-14 下午1:45, "Namit Jain" wrote:

    The large table is only a problem if the number of values for a given key
    are
    very large – they are stored in memory.

    If you dimension tables are small, you can use map-join. That way, no
    reduction is needed. You need to specify the hint in the select clause
    and the list of tables which are small.


    For eg:


    select /*+ MAPJOIN(smalll_table1, small_table2) */ ... from small_table1
    join
    big_table on ... join small_table2 on ...


    Thanks,
    -namit

    On 7/13/09 9:44 PM, "Jason Michael" wrote:

    In our hive instance, we have one large fact-type table that joins to
    several
    dimension tables on integer keys. I know from reading the Language Manual
    that in ordering joins it is best to join the largest table last in the
    sequence in order to minimize memory usage. This won’t work in the
    situation
    where you want to join the large fact table to more than one dimension.
    Something like:

    select ... from small_table1 join big_table on ... join small_table2 on ...

    I have to imagine this is a pretty common pattern, is there any guidance
    for
    doing this sort of star schema join?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJul 14, '09 at 4:44a
activeJul 14, '09 at 4:02p
posts6
users4
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase