FAQ
I am facing issues with a query where I am joining two fairly large tables
on the partitioned column along with other common columns. The expected
output is not in line with what I expect it to be. Since the query is very
complex, I will simplify it so that people can provide inputs if they have
faced similar issues or if I am doing something totally wrong.

TABLE A:
a_id bigint
common_id bigint
some_string string
total_count bigint
part_col string <---- this is the partitioned column

TABLE B:
b_int bigint
common_id bigint
some_string string
total_sum bigint
part_col string <---- this is the partitioned column

now the query is as follows:
SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string

Does HIVE not like to join on the partitioned columns ? because when i
create a join on just the partitioned column the reduce step never finishes.

I am using HIVE 0.5.0

Thanks,
Viral

Search Discussions

  • Viral Bajaria at Jan 19, 2011 at 4:08 am
    I haven't heard back from any on the list and am still struggling to join
    two tables on partitioned column

    Has anyone every tried joining two tables on a paritioned column and the
    results are not as expected ?
    On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria wrote:

    I am facing issues with a query where I am joining two fairly large tables
    on the partitioned column along with other common columns. The expected
    output is not in line with what I expect it to be. Since the query is very
    complex, I will simplify it so that people can provide inputs if they have
    faced similar issues or if I am doing something totally wrong.

    TABLE A:
    a_id bigint
    common_id bigint
    some_string string
    total_count bigint
    part_col string <---- this is the partitioned column

    TABLE B:
    b_int bigint
    common_id bigint
    some_string string
    total_sum bigint
    part_col string <---- this is the partitioned column

    now the query is as follows:
    SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
    sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
    t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
    t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string

    Does HIVE not like to join on the partitioned columns ? because when i
    create a join on just the partitioned column the reduce step never finishes.

    I am using HIVE 0.5.0

    Thanks,
    Viral
  • Ajo Fod at Jan 19, 2011 at 4:40 am
    Can you try this with a dummy table with very few rows ... to see if
    the reason the script doesn't finish is a computational issue?

    One other thing is to try with a combined partition, to see if it is a
    problem with the partitioning.

    Also, take a look at the results of an EXPLAIN statement, see if
    there are any hints there.

    NOTE: I'm new to hive too.

    -Ajo

    On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria wrote:
    I haven't heard back from any on the list and am still struggling to join
    two tables on partitioned column

    Has anyone every tried joining two tables on a paritioned column and the
    results are not as expected ?
    On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria wrote:

    I am facing issues with a query where I am joining two fairly large tables
    on the partitioned column along with other common columns. The expected
    output is not in line with what I expect it to be. Since the query is very
    complex, I will simplify it so that people can provide inputs if they have
    faced similar issues or if I am doing something totally wrong.
    TABLE A:
    a_id bigint
    common_id bigint
    some_string string
    total_count bigint
    part_col string <---- this is the partitioned column
    TABLE B:
    b_int bigint
    common_id bigint
    some_string string
    total_sum bigint
    part_col string <---- this is the partitioned column
    now the query is as follows:
    SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
    sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
    t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
    t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
    Does HIVE not like to join on the partitioned columns ? because when i
    create a join on just the partitioned column the reduce step never finishes.
    I am using HIVE 0.5.0
    Thanks,
    Viral
  • Appan Thirumaligai at Jan 19, 2011 at 6:37 pm
    Viral,

    I tried the queries below (similar to yours) and I get the expected results when I do the join. I ran my queries after building hive from the latest source and hadoop 0.20+.

    create table table_a(a_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    create table table_b(b_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    dfs -mkdir /user/data/table_a;
    dfs -mkdir /user/data/table_b;
    dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
    dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
    alter table table_a add partition (part_col = 'mypart') location '/user/data/table_a';
    alter table table_b add partition (part_col = 'mypart') location '/user/data/table_b';
    select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
    -->> Returns expected result
    select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
    --->>Works fine.
    select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st* from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
    --->Works fine.

    I created the two files with sample data in them and copied it to hdfs

    I'll try later on your hive 0.5.0 but looks like there might be something wrong in your query.
    On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:

    Can you try this with a dummy table with very few rows ... to see if
    the reason the script doesn't finish is a computational issue?

    One other thing is to try with a combined partition, to see if it is a
    problem with the partitioning.

    Also, take a look at the results of an EXPLAIN statement, see if
    there are any hints there.

    NOTE: I'm new to hive too.

    -Ajo

    On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria wrote:
    I haven't heard back from any on the list and am still struggling to join
    two tables on partitioned column

    Has anyone every tried joining two tables on a paritioned column and the
    results are not as expected ?
    On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <viral.bajaria@gmail.com>
    wrote:
    I am facing issues with a query where I am joining two fairly large tables
    on the partitioned column along with other common columns. The expected
    output is not in line with what I expect it to be. Since the query is very
    complex, I will simplify it so that people can provide inputs if they have
    faced similar issues or if I am doing something totally wrong.
    TABLE A:
    a_id bigint
    common_id bigint
    some_string string
    total_count bigint
    part_col string <---- this is the partitioned column
    TABLE B:
    b_int bigint
    common_id bigint
    some_string string
    total_sum bigint
    part_col string <---- this is the partitioned column
    now the query is as follows:
    SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
    sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
    t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
    t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
    Does HIVE not like to join on the partitioned columns ? because when i
    create a join on just the partitioned column the reduce step never finishes.
    I am using HIVE 0.5.0
    Thanks,
    Viral
    Appan Thirumaligai
    appan@ngmoco.com
    Ph:1-818-472-8427
    ngmoco:)
  • Viral Bajaria at Jan 19, 2011 at 6:47 pm
    Thanks Appan for verifying. I will do some more tests on my side too and let
    you know the results.

    I tried a different version of the query where I join'ed two sub-queries for
    the same partitions and the data comes out to be correct.

    I will see if I can post the real-world example to the list, because that
    might sound like a more practical example.

    If you still have your example(s) do you mind sending me your query-plan for


    select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count)
    from table_a t1 join table_b t2 on t1.part_col = t2.part_col and
    t1.common_id = t2.common_id where t1.part_col >= 'mypart' and t2.part_col >=
    'mypart' group by t1.some_string,t2.some_string;

    -Viral
    On Wed, Jan 19, 2011 at 10:36 AM, Appan Thirumaligai wrote:

    Viral,

    I tried the queries below (similar to yours) and I get the expected results
    when I do the join. I ran my queries after building hive from the latest
    source and hadoop 0.20+.
    create table table_a(a_id bigint, common_id bigint, some_string
    string,total_count bigint) partitioned by (part_col string) ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
    TEXTFILE;
    create table table_b(b_id bigint, common_id bigint, some_string
    string,total_count bigint) partitioned by (part_col string) ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
    TEXTFILE;
    dfs -mkdir /user/data/table_a;
    dfs -mkdir /user/data/table_b;
    dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
    dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
    alter table table_a add partition (part_col = 'mypart') location
    '/user/data/table_a';
    alter table table_b add partition (part_col = 'mypart') location
    '/user/data/table_b';
    select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
    -->> Returns expected result
    select
    t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
    table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
    'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
    --->>Works fine.
    select
    t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
    table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
    'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st*
    from table_a t1 join table_b t2 on t1.part_col = t2.part_col where
    t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
    --->Works fine.

    I created the two files with sample data in them and copied it to hdfs

    I'll try later on your hive 0.5.0 but looks like there might be something
    wrong in your query.

    On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:

    Can you try this with a dummy table with very few rows ... to see if
    the reason the script doesn't finish is a computational issue?

    One other thing is to try with a combined partition, to see if it is a
    problem with the partitioning.

    Also, take a look at the results of an EXPLAIN statement, see if
    there are any hints there.

    NOTE: I'm new to hive too.

    -Ajo


    On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria wrote:

    I haven't heard back from any on the list and am still struggling to join

    two tables on partitioned column


    Has anyone every tried joining two tables on a paritioned column and the

    results are not as expected ?

    On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <viral.bajaria@gmail.com>

    wrote:


    I am facing issues with a query where I am joining two fairly large
    tables

    on the partitioned column along with other common columns. The expected

    output is not in line with what I expect it to be. Since the query is
    very

    complex, I will simplify it so that people can provide inputs if they
    have

    faced similar issues or if I am doing something totally wrong.

    TABLE A:

    a_id bigint

    common_id bigint

    some_string string

    total_count bigint

    part_col string <---- this is the partitioned column

    TABLE B:

    b_int bigint

    common_id bigint

    some_string string

    total_sum bigint

    part_col string <---- this is the partitioned column

    now the query is as follows:

    SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,

    sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =

    t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1'
    AND

    t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string

    Does HIVE not like to join on the partitioned columns ? because when i

    create a join on just the partitioned column the reduce step never
    finishes.

    I am using HIVE 0.5.0

    Thanks,

    Viral



    Appan Thirumaligai
    appan@ngmoco.com
    Ph:1-818-472-8427
    ngmoco:)
  • Appan Thirumaligai at Jan 19, 2011 at 7:07 pm
    EXPLAIN select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id = t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;

    OK
    ABSTRACT SYNTAX TREE:
    (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF table_a t1) (TOK_TABREF table_b t2) (and (= (. (TOK_TABLE_OR_COL t1) part_col) (. (TOK_TABLE_OR_COL t2) part_col)) (= (. (TOK_TABLE_OR_COL t1) common_id) (. (TOK_TABLE_OR_COL t2) common_id))))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) some_string)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) some_string)) (TOK_SELEXPR (TOK_FUNCTION sum (. (TOK_TABLE_OR_COL t1) total_count))) (TOK_SELEXPR (TOK_FUNCTION sum (. (TOK_TABLE_OR_COL t2) total_count)))) (TOK_WHERE (and (>= (. (TOK_TABLE_OR_COL t1) part_col) 'mypart') (>= (. (TOK_TABLE_OR_COL t2) part_col) 'mypart'))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL t1) some_string) (. (TOK_TABLE_OR_COL t2) some_string))))

    STAGE DEPENDENCIES:
    Stage-1 is a root stage
    Stage-2 depends on stages: Stage-1
    Stage-0 is a root stage

    STAGE PLANS:
    Stage: Stage-1
    Map Reduce
    Alias -> Map Operator Tree:
    t1
    TableScan
    alias: t1
    Filter Operator
    predicate:
    expr: (part_col >= 'mypart')
    type: boolean
    Reduce Output Operator
    key expressions:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    sort order: ++
    Map-reduce partition columns:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    tag: 0
    value expressions:
    expr: some_string
    type: string
    expr: total_count
    type: bigint
    expr: part_col
    type: string
    t2
    TableScan
    alias: t2
    Filter Operator
    predicate:
    expr: (part_col >= 'mypart')
    type: boolean
    Reduce Output Operator
    key expressions:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    sort order: ++
    Map-reduce partition columns:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    tag: 1
    value expressions:
    expr: some_string
    type: string
    expr: total_count
    type: bigint
    expr: part_col
    type: string
    Reduce Operator Tree:
    Join Operator
    condition map:
    Inner Join 0 to 1
    condition expressions:
    0 {VALUE._col2} {VALUE._col3} {VALUE._col4}
    1 {VALUE._col2} {VALUE._col3} {VALUE._col4}
    handleSkewJoin: false
    outputColumnNames: _col2, _col3, _col4, _col9, _col10, _col11
    Filter Operator
    predicate:
    expr: ((_col4 >= 'mypart') and (_col11 >= 'mypart'))
    type: boolean
    Select Operator
    expressions:
    expr: _col2
    type: string
    expr: _col9
    type: string
    expr: _col3
    type: bigint
    expr: _col10
    type: bigint
    outputColumnNames: _col2, _col9, _col3, _col10
    Group By Operator
    aggregations:
    expr: sum(_col3)
    expr: sum(_col10)
    bucketGroup: false
    keys:
    expr: _col2
    type: string
    expr: _col9
    type: string
    mode: hash
    outputColumnNames: _col0, _col1, _col2, _col3
    File Output Operator
    compressed: false
    GlobalTableId: 0
    table:
    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

    Stage: Stage-2
    Map Reduce
    Alias -> Map Operator Tree:
    hdfs://localhost:8022/tmp/hive-training/hive_2011-01-19_11-05-34_526_453408324928472657/-mr-10002
    Reduce Output Operator
    key expressions:
    expr: _col0
    type: string
    expr: _col1
    type: string
    sort order: ++
    Map-reduce partition columns:
    expr: _col0
    type: string
    expr: _col1
    type: string
    tag: -1
    value expressions:
    expr: _col2
    type: bigint
    expr: _col3
    type: bigint
    Reduce Operator Tree:
    Group By Operator
    aggregations:
    expr: sum(VALUE._col0)
    expr: sum(VALUE._col1)
    bucketGroup: false
    keys:
    expr: KEY._col0
    type: string
    expr: KEY._col1
    type: string
    mode: mergepartial
    outputColumnNames: _col0, _col1, _col2, _col3
    Select Operator
    expressions:
    expr: _col0
    type: string
    expr: _col1
    type: string
    expr: _col2
    type: bigint
    expr: _col3
    type: bigint
    outputColumnNames: _col0, _col1, _col2, _col3
    File Output Operator
    compressed: false
    GlobalTableId: 0
    table:
    input format: org.apache.hadoop.mapred.TextInputFormat
    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

    Stage: Stage-0
    Fetch Operator
    limit: -1
    On Jan 19, 2011, at 10:47 AM, Viral Bajaria wrote:

    Thanks Appan for verifying. I will do some more tests on my side too and let you know the results.

    I tried a different version of the query where I join'ed two sub-queries for the same partitions and the data comes out to be correct.

    I will see if I can post the real-world example to the list, because that might sound like a more practical example.

    If you still have your example(s) do you mind sending me your query-plan for

    select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id = t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;

    -Viral
    On Wed, Jan 19, 2011 at 10:36 AM, Appan Thirumaligai wrote:
    Viral,

    I tried the queries below (similar to yours) and I get the expected results when I do the join. I ran my queries after building hive from the latest source and hadoop 0.20+.
    create table table_a(a_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    create table table_b(b_id bigint, common_id bigint, some_string string,total_count bigint) partitioned by (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
    dfs -mkdir /user/data/table_a;
    dfs -mkdir /user/data/table_b;
    dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
    dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
    alter table table_a add partition (part_col = 'mypart') location '/user/data/table_a';
    alter table table_b add partition (part_col = 'mypart') location '/user/data/table_b';
    select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
    -->> Returns expected result
    select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
    --->>Works fine.
    select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st* from table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
    --->Works fine.

    I created the two files with sample data in them and copied it to hdfs

    I'll try later on your hive 0.5.0 but looks like there might be something wrong in your query.
    On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:

    Can you try this with a dummy table with very few rows ... to see if
    the reason the script doesn't finish is a computational issue?

    One other thing is to try with a combined partition, to see if it is a
    problem with the partitioning.

    Also, take a look at the results of an EXPLAIN statement, see if
    there are any hints there.

    NOTE: I'm new to hive too.

    -Ajo

    On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria wrote:
    I haven't heard back from any on the list and am still struggling to join
    two tables on partitioned column

    Has anyone every tried joining two tables on a paritioned column and the
    results are not as expected ?
    On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <viral.bajaria@gmail.com>
    wrote:
    I am facing issues with a query where I am joining two fairly large tables
    on the partitioned column along with other common columns. The expected
    output is not in line with what I expect it to be. Since the query is very
    complex, I will simplify it so that people can provide inputs if they have
    faced similar issues or if I am doing something totally wrong.
    TABLE A:
    a_id bigint
    common_id bigint
    some_string string
    total_count bigint
    part_col string <---- this is the partitioned column
    TABLE B:
    b_int bigint
    common_id bigint
    some_string string
    total_sum bigint
    part_col string <---- this is the partitioned column
    now the query is as follows:
    SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,
    sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =
    t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' AND
    t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string
    Does HIVE not like to join on the partitioned columns ? because when i
    create a join on just the partitioned column the reduce step never finishes.
    I am using HIVE 0.5.0
    Thanks,
    Viral
    Appan Thirumaligai
    appan@ngmoco.com
    Ph:1-818-472-8427
    ngmoco:)
    Appan Thirumaligai
    appan@ngmoco.com
    Ph:1-818-472-8427
    ngmoco:)
  • Viral Bajaria at Jan 20, 2011 at 12:45 am
    Thanks again.

    I think I figured out the bug (not sure if it's a bug or whether that's
    a known limitation when creating a third-level join) .... we need another
    table c to re-create my scenario.

    table_a
    create table table_a(a_id bigint, common_id bigint, int_a int, int_b int,
    int_c int, int_d int, string_a string, total_count bigint) partitioned by
    (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
    TERMINATED BY '\n' STORED AS TEXTFILE;

    table_b
    create table table_b(b_id bigint, common_id bigint, int_a int, int_b int,
    int_c int, int_d int, string_b string, total_count bigint) partitioned by
    (part_col string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
    TERMINATED BY '\n' STORED AS TEXTFILE;
    table_c
    create table table_c(int_c int, string_c string) ROW FORMAT DELIMITED FIELDS
    TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

    query
    explain select c.string_c, sum(a.total_count), sum(b.total_count) from
    table_a a join table_b b on a.common_id = b.common_id and a.int_c = b.int_c
    and a.int_d = b.int_d join table_c c on a.int_c = c.int_c where a.part_col
    = "blah1" and b.part_col >= "blah1" group by c.string_c;
    If you look at the query plan for table_b you will see the value expression
    does not project the column int_c, and if you look at the join operator in
    the query plan it show a join operating between 3 columns from table_a and 2
    columns from table_b which is not the intention of the query.

    I think hive should output all the columns from table_b which are part of
    the join conditions and not look to see if the column is going to be
    consumed in the later stages.

    Do you think I am not writing the hive query in a right way ? the query
    would return results as expected in a mysql or sql-server environment.

    Thanks,
    Viral
    On Wed, Jan 19, 2011 at 11:06 AM, Appan Thirumaligai wrote:

    EXPLAIN select
    t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
    table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id =
    t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group
    by t1.some_string,t2.some_string;

    OK
    ABSTRACT SYNTAX TREE:
    (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF table_a t1) (TOK_TABREF
    table_b t2) (and (= (. (TOK_TABLE_OR_COL t1) part_col) (. (TOK_TABLE_OR_COL
    t2) part_col)) (= (. (TOK_TABLE_OR_COL t1) common_id) (. (TOK_TABLE_OR_COL
    t2) common_id))))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
    (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) some_string)) (TOK_SELEXPR
    (. (TOK_TABLE_OR_COL t2) some_string)) (TOK_SELEXPR (TOK_FUNCTION sum (.
    (TOK_TABLE_OR_COL t1) total_count))) (TOK_SELEXPR (TOK_FUNCTION sum (.
    (TOK_TABLE_OR_COL t2) total_count)))) (TOK_WHERE (and (>= (.
    (TOK_TABLE_OR_COL t1) part_col) 'mypart') (>= (. (TOK_TABLE_OR_COL t2)
    part_col) 'mypart'))) (TOK_GROUPBY (. (TOK_TABLE_OR_COL t1) some_string) (.
    (TOK_TABLE_OR_COL t2) some_string))))

    STAGE DEPENDENCIES:
    Stage-1 is a root stage
    Stage-2 depends on stages: Stage-1
    Stage-0 is a root stage

    STAGE PLANS:
    Stage: Stage-1
    Map Reduce
    Alias -> Map Operator Tree:
    t1
    TableScan
    alias: t1
    Filter Operator
    predicate:
    expr: (part_col >= 'mypart')
    type: boolean
    Reduce Output Operator
    key expressions:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    sort order: ++
    Map-reduce partition columns:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    tag: 0
    value expressions:
    expr: some_string
    type: string
    expr: total_count
    type: bigint
    expr: part_col
    type: string
    t2
    TableScan
    alias: t2
    Filter Operator
    predicate:
    expr: (part_col >= 'mypart')
    type: boolean
    Reduce Output Operator
    key expressions:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    sort order: ++
    Map-reduce partition columns:
    expr: part_col
    type: string
    expr: common_id
    type: bigint
    tag: 1
    value expressions:
    expr: some_string
    type: string
    expr: total_count
    type: bigint
    expr: part_col
    type: string
    Reduce Operator Tree:
    Join Operator
    condition map:
    Inner Join 0 to 1
    condition expressions:
    0 {VALUE._col2} {VALUE._col3} {VALUE._col4}
    1 {VALUE._col2} {VALUE._col3} {VALUE._col4}
    handleSkewJoin: false
    outputColumnNames: _col2, _col3, _col4, _col9, _col10, _col11
    Filter Operator
    predicate:
    expr: ((_col4 >= 'mypart') and (_col11 >= 'mypart'))
    type: boolean
    Select Operator
    expressions:
    expr: _col2
    type: string
    expr: _col9
    type: string
    expr: _col3
    type: bigint
    expr: _col10
    type: bigint
    outputColumnNames: _col2, _col9, _col3, _col10
    Group By Operator
    aggregations:
    expr: sum(_col3)
    expr: sum(_col10)
    bucketGroup: false
    keys:
    expr: _col2
    type: string
    expr: _col9
    type: string
    mode: hash
    outputColumnNames: _col0, _col1, _col2, _col3
    File Output Operator
    compressed: false
    GlobalTableId: 0
    table:
    input format:
    org.apache.hadoop.mapred.SequenceFileInputFormat
    output format:
    org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

    Stage: Stage-2
    Map Reduce
    Alias -> Map Operator Tree:

    hdfs://localhost:8022/tmp/hive-training/hive_2011-01-19_11-05-34_526_453408324928472657/-mr-10002

    Reduce Output Operator
    key expressions:
    expr: _col0
    type: string
    expr: _col1
    type: string
    sort order: ++
    Map-reduce partition columns:
    expr: _col0
    type: string
    expr: _col1
    type: string
    tag: -1
    value expressions:
    expr: _col2
    type: bigint
    expr: _col3
    type: bigint
    Reduce Operator Tree:
    Group By Operator
    aggregations:
    expr: sum(VALUE._col0)
    expr: sum(VALUE._col1)
    bucketGroup: false
    keys:
    expr: KEY._col0
    type: string
    expr: KEY._col1
    type: string
    mode: mergepartial
    outputColumnNames: _col0, _col1, _col2, _col3
    Select Operator
    expressions:
    expr: _col0
    type: string
    expr: _col1
    type: string
    expr: _col2
    type: bigint
    expr: _col3
    type: bigint
    outputColumnNames: _col0, _col1, _col2, _col3
    File Output Operator
    compressed: false
    GlobalTableId: 0
    table:
    input format: org.apache.hadoop.mapred.TextInputFormat
    output format:
    org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

    Stage: Stage-0
    Fetch Operator
    limit: -1

    On Jan 19, 2011, at 10:47 AM, Viral Bajaria wrote:

    Thanks Appan for verifying. I will do some more tests on my side too and
    let you know the results.

    I tried a different version of the query where I join'ed two sub-queries
    for the same partitions and the data comes out to be correct.

    I will see if I can post the real-world example to the list, because that
    might sound like a more practical example.

    If you still have your example(s) do you mind sending me your query-plan
    for

    select
    t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
    table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id =
    t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group
    by t1.some_string,t2.some_string;

    -Viral
    On Wed, Jan 19, 2011 at 10:36 AM, Appan Thirumaligai <
    athirumaligai@ngmoco.com> wrote:
    Viral,

    I tried the queries below (similar to yours) and I get the expected
    results when I do the join. I ran my queries after building hive from the
    latest source and hadoop 0.20+.
    create table table_a(a_id bigint, common_id bigint, some_string
    string,total_count bigint) partitioned by (part_col string) ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
    TEXTFILE;
    create table table_b(b_id bigint, common_id bigint, some_string
    string,total_count bigint) partitioned by (part_col string) ROW FORMAT
    DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS
    TEXTFILE;
    dfs -mkdir /user/data/table_a;
    dfs -mkdir /user/data/table_b;
    dfs -put /home/training/hiveug/table_a.csv /user/data/table_a;
    dfs -put /home/training/hiveug/table_b.csv /user/data/table_b;
    alter table table_a add partition (part_col = 'mypart') location
    '/user/data/table_a';
    alter table table_b add partition (part_col = 'mypart') location
    '/user/data/table_b';
    select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col;
    -->> Returns expected result
    select
    t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
    table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
    'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string;
    --->>Works fine.
    select
    t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from
    table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >=
    'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st*
    from table_a t1 join table_b t2 on t1.part_col = t2.part_col where
    t1.part_col >= 'mypart' and t2.part_col >= 'mypart';
    --->Works fine.

    I created the two files with sample data in them and copied it to hdfs

    I'll try later on your hive 0.5.0 but looks like there might be something
    wrong in your query.

    On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote:

    Can you try this with a dummy table with very few rows ... to see if
    the reason the script doesn't finish is a computational issue?

    One other thing is to try with a combined partition, to see if it is a
    problem with the partitioning.

    Also, take a look at the results of an EXPLAIN statement, see if
    there are any hints there.

    NOTE: I'm new to hive too.

    -Ajo


    On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria <viral.bajaria@gmail.com>
    wrote:

    I haven't heard back from any on the list and am still struggling to join

    two tables on partitioned column


    Has anyone every tried joining two tables on a paritioned column and the

    results are not as expected ?

    On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <viral.bajaria@gmail.com>

    wrote:


    I am facing issues with a query where I am joining two fairly large
    tables

    on the partitioned column along with other common columns. The expected

    output is not in line with what I expect it to be. Since the query is
    very

    complex, I will simplify it so that people can provide inputs if they
    have

    faced similar issues or if I am doing something totally wrong.

    TABLE A:

    a_id bigint

    common_id bigint

    some_string string

    total_count bigint

    part_col string <---- this is the partitioned column

    TABLE B:

    b_int bigint

    common_id bigint

    some_string string

    total_sum bigint

    part_col string <---- this is the partitioned column

    now the query is as follows:

    SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string,

    sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col =

    t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1'
    AND

    t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string

    Does HIVE not like to join on the partitioned columns ? because when i

    create a join on just the partitioned column the reduce step never
    finishes.

    I am using HIVE 0.5.0

    Thanks,

    Viral



    Appan Thirumaligai
    appan@ngmoco.com
    Ph:1-818-472-8427
    ngmoco:)
    Appan Thirumaligai
    appan@ngmoco.com
    Ph:1-818-472-8427
    ngmoco:)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 18, '11 at 10:05a
activeJan 20, '11 at 12:45a
posts7
users3
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase