FAQ
Hi all,

I have a perplexing problem with joins in Hive. I have two tables "A" and
"B". A contains two columns x and y. B also contains x and y, and also z. I
want to select the values of z from B, for every (x, y) I see in table A.

So I wrote this statement:

SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

This crunches away and then comes back with an empty resultset. I have
exported the same data into MySQL and ran the same query; it works just fine
there. Does Hive support multi-column join criteria?

Thanks,
- Aaron Kimball

Search Discussions

  • Zheng Shao at Jan 23, 2009 at 5:54 am
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
  • Aaron Kimball at Jan 23, 2009 at 4:57 am
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here
    goes.

    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and
    z as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a AND
    ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP
    BY ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron

    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao wrote:

    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
  • Zheng Shao at Jan 23, 2009 at 7:43 am
    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will
    produce a join result row?

    Zheng
    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball wrote:

    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here
    goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
    two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and
    z as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
    AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
    ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP
    BY ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron

    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao wrote:

    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.
    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng

    --
    Yours,
    Zheng
  • Aaron Kimball at Jan 23, 2009 at 9:24 am
    Not only can I verify that the rows are present, I broke down and coded the
    join in Java MapReduce by hand. Works.
    - Aaron
    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao wrote:

    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will
    produce a join result row?

    Zheng

    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball wrote:

    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but
    here goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
    two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y,
    and z as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
    AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
    ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a
    "GROUP BY ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron

    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao wrote:

    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.
    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng

    --
    Yours,
    Zheng
  • Ashish Thusoo at Jan 23, 2009 at 11:49 am
    Aaron

    can you send the plan for this query..

    explain extended <your query>

    This is very unexpected..

    Thanks,
    Ashish
    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded the join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao wrote:
    This is unexpected. We did try similar cases before (with multiple join keys).

    Can you verify by finding out the actual rows in each table that will produce a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null), FieldSchema(name:b,type:int,comment:null), FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null), FieldSchema(name:b,type:int,comment:null), FieldSchema(name:c,type:int,comment:null), FieldSchema(name:cityid,type:int,comment:null), FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server name redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and z as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP BY ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng
  • Aaron Kimball at Jan 23, 2009 at 11:56 pm
    I solved the bug. As it turns out, there was trailing whitespace in some of
    my input files. So Hive wasn't matching the column's contents correctly,
    since Hive doesn't do an implicit trim() (my handwritten code did this). And
    of course, when I looked at the input files with less to verify their
    contents, the whitespace was invisible :P

    - Aaron
    On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo wrote:

    Aaron

    can you send the plan for this query..

    explain extended <your query>

    This is very unexpected..

    Thanks,
    Ashish
    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded the
    join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao <zshao9@gmail.com<mailto:
    zshao9@gmail.com>> wrote:
    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will
    produce a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball <aaron@cloudera.com<mailto:
    aaron@cloudera.com>> wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here
    goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
    two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and
    z as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
    AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
    ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP
    BY ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao <zshao9@gmail.com<mailto:
    zshao9@gmail.com>> wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng
  • Raghu Murthy at Jan 24, 2009 at 12:01 am
    We could add trim to hive load, but, unfortunately it has been patented by
    IBM!

    "Method of automatically removing leading and trailing space characters
    from data being entered into a database system"

    http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1
    &u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475
    ,086&RS=PN/7,475,086

    "Retrieving data from a database system without leading and trailing
    space characters"

    http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=
    1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1="20070282820".PGNR.&OS=DN/200
    70282820&RS=DN/20070282820

    On 1/23/09 3:55 PM, "Aaron Kimball" wrote:

    I solved the bug. As it turns out, there was trailing whitespace in some of my
    input files. So Hive wasn't matching the column's contents correctly, since
    Hive doesn't do an implicit trim() (my handwritten code did this). And of
    course, when I looked at the input files with less to verify their contents,
    the whitespace was invisible :P

    - Aaron
    On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo wrote:
    Aaron

    can you send the plan for this query..

    explain extended <your query>

    This is very unexpected..

    Thanks,
    Ashish
    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded the
    join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao
    wrote:
    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will produce
    a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball
    wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here
    goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>>
    o
    n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>>
    o
    >>
    op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>>
    T
    >>
    extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>>
    s
    >>
    erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>>
    a
    >>
    meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>>
    t
    itionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>>
    ,
    >>
    lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>>
    p
    e:int,comment:null), FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>>
    .
    >>
    mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>>
    t
    >>
    OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>>
    i
    >>
    alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>>
    t
    >>
    ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>>
    i
    onKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and z
    as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a AND
    ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP
    BY
    ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao
    wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z.
    I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com <http://mobile.google.com>
    <http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng
  • Aaron Kimball at Jan 24, 2009 at 12:11 am
    Bah, you gotta be kidding me :P

    Something that might be a reasonable compromise is to print a warning about
    trailing whitespace on fields?

    - Aaron
    On Fri, Jan 23, 2009 at 4:00 PM, Raghu Murthy wrote:

    We could add trim to hive load, but, unfortunately it has been patented by
    IBM!

    "Method of automatically removing leading and trailing space characters
    from data being entered into a database system"


    http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1

    &u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475
    ,086&RS=PN/7,475,086<http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475,086&RS=PN/7,475,086>

    "Retrieving data from a database system without leading and trailing
    space characters"


    http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=
    1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=<http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=>
    "20070282820".PGNR.&OS=DN/200
    70282820&RS=DN/20070282820

    On 1/23/09 3:55 PM, "Aaron Kimball" wrote:

    I solved the bug. As it turns out, there was trailing whitespace in some of my
    input files. So Hive wasn't matching the column's contents correctly, since
    Hive doesn't do an implicit trim() (my handwritten code did this). And of
    course, when I looked at the input files with less to verify their contents,
    the whitespace was invisible :P

    - Aaron
    On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo wrote:
    Aaron

    can you send the plan for this query..

    explain extended <your query>

    This is very unexpected..

    Thanks,
    Ashish
    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded
    the
    join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao
    wrote:
    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will
    produce
    a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball
    wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but
    here
    goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
    two.
    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>>
    o
    n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>>
    o
    op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>>
    T
    extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>>
    s
    erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>>
    a
    meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>>
    t
    itionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>>
    ,
    lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>>
    p
    e:int,comment:null), FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>>
    .
    mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>>
    t
    OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>>
    i
    alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>>
    t
    ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>>
    i
    onKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y,
    and z
    as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
    AND
    ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
    ip_locations.c)
    I would eventually like to add a "count(1)" to the field list, and a
    "GROUP
    BY
    ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao
    wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng


    On 1/22/09, Aaron Kimball <aaron@cloudera.com >>
    wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A"
    and
    "B". A contains two columns x and y. B also contains x and y, and
    also z.
    I
    want to select the values of z from B, for every (x, y) I see in table
    A.
    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works
    just
    fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com <
    http://mobile.google.com>
    <http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng
  • Joydeep Sen Sarma at Jan 24, 2009 at 3:38 am
    Moral of the story - don't google around too much before writing code.

    -----Original Message-----
    From: Raghu Murthy
    Sent: Friday, January 23, 2009 4:01 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    We could add trim to hive load, but, unfortunately it has been patented by
    IBM!

    "Method of automatically removing leading and trailing space characters
    from data being entered into a database system"

    http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1
    &u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475
    ,086&RS=PN/7,475,086

    "Retrieving data from a database system without leading and trailing
    space characters"

    http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=
    1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1="20070282820".PGNR.&OS=DN/200
    70282820&RS=DN/20070282820

    On 1/23/09 3:55 PM, "Aaron Kimball" wrote:

    I solved the bug. As it turns out, there was trailing whitespace in some of my
    input files. So Hive wasn't matching the column's contents correctly, since
    Hive doesn't do an implicit trim() (my handwritten code did this). And of
    course, when I looked at the input files with less to verify their contents,
    the whitespace was invisible :P

    - Aaron
    On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo wrote:
    Aaron

    can you send the plan for this query..

    explain extended <your query>

    This is very unexpected..

    Thanks,
    Ashish
    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded the
    join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao
    wrote:
    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will produce
    a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball
    wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here
    goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>>
    o
    n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>>
    o
    >>
    op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>>
    T
    >>
    extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>>
    s
    >>
    erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>>
    a
    >>
    meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>>
    t
    itionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>>
    ,
    >>
    lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>>
    p
    e:int,comment:null), FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>>
    .
    >>
    mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>>
    t
    >>
    OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>>
    i
    >>
    alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>>
    t
    >>
    ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>>
    i
    onKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and z
    as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a AND
    ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP
    BY
    ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao
    wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z.
    I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com <http://mobile.google.com>
    <http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng
  • Aaron Kimball at Jan 24, 2009 at 5:25 am
    :(
    On Fri, Jan 23, 2009 at 7:38 PM, Joydeep Sen Sarma wrote:

    Moral of the story - don't google around too much before writing code.

    -----Original Message-----
    From: Raghu Murthy
    Sent: Friday, January 23, 2009 4:01 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    We could add trim to hive load, but, unfortunately it has been patented by
    IBM!

    "Method of automatically removing leading and trailing space characters
    from data being entered into a database system"


    http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1

    &u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475
    ,086&RS=PN/7,475,086<http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475,086&RS=PN/7,475,086>

    "Retrieving data from a database system without leading and trailing
    space characters"


    http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=
    1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=<http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1=>
    "20070282820".PGNR.&OS=DN/200
    70282820&RS=DN/20070282820

    On 1/23/09 3:55 PM, "Aaron Kimball" wrote:

    I solved the bug. As it turns out, there was trailing whitespace in some of my
    input files. So Hive wasn't matching the column's contents correctly, since
    Hive doesn't do an implicit trim() (my handwritten code did this). And of
    course, when I looked at the input files with less to verify their contents,
    the whitespace was invisible :P

    - Aaron
    On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo wrote:
    Aaron

    can you send the plan for this query..

    explain extended <your query>

    This is very unexpected..

    Thanks,
    Ashish
    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded
    the
    join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao
    wrote:
    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will
    produce
    a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball
    wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but
    here
    goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
    two.
    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>>
    o
    n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>>
    o
    op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>>
    T
    extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>>
    s
    erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>>
    a
    meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>>
    t
    itionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>>
    ,
    lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>>
    p
    e:int,comment:null), FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>>
    .
    mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>>
    t
    OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>>
    i
    alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>>
    t
    ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>>
    i
    onKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y,
    and z
    as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
    AND
    ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
    ip_locations.c)
    I would eventually like to add a "count(1)" to the field list, and a
    "GROUP
    BY
    ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao
    wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng


    On 1/22/09, Aaron Kimball <aaron@cloudera.com >>
    wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A"
    and
    "B". A contains two columns x and y. B also contains x and y, and
    also z.
    I
    want to select the values of z from B, for every (x, y) I see in table
    A.
    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works
    just
    fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com <
    http://mobile.google.com>
    <http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng
  • Edward Capriolo at Jan 24, 2009 at 6:13 am
    Actually, an implicit trim() is probably a bad idea. It is more of a
    gimmick. Would you turn the trim on for the table or only for certain
    rows? Maybe a patent should be made for auto-rounding, auto-spel
    lcheck, or auto-square root. Where does it end?

    In any case an idea might be to add some syntax to the select
    statement to show whitespace as an octal number or space. When
    selecting the row the write space would be visible.
  • Ashish Thusoo at Jan 24, 2009 at 4:22 pm
    yeah we could do that. We could also just make the field separator that is displayed by the separator configurable...

    Regarding the patent.. in this day and age people patent the most trivial of things as well.. we could also perhaps make this more explicit by allowing simple transformations in the load statement, similar to what mysql does using set expressions in the syntax...

    http://dev.mysql.com/doc/refman/5.1/en/load-data.html

    Ashish

    ________________________________________
    From: Edward Capriolo [edlinuxguru@gmail.com]
    Sent: Friday, January 23, 2009 10:12 PM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Actually, an implicit trim() is probably a bad idea. It is more of a
    gimmick. Would you turn the trim on for the table or only for certain
    rows? Maybe a patent should be made for auto-rounding, auto-spel
    lcheck, or auto-square root. Where does it end?

    In any case an idea might be to add some syntax to the select
    statement to show whitespace as an octal number or space. When
    selecting the row the write space would be visible.
  • Namit Jain at Jan 27, 2009 at 2:55 am
    Can you send the data for the tables ? I tried the following and it works fine:



    CREATE TABLE tmp1(key1 INT, value1 STRING, key2 INT, value2 STRING) STORED AS TEXTFILE;

    FROM src src1 INSERT OVERWRITE TABLE tmp1 SELECT src1.key, src1.value, src1.key, src1.value;

    SELECT * from tmp1;

    CREATE TABLE tmp2(key1 INT, value1 STRING, key2 INT, value2 STRING) STORED AS TEXTFILE;

    FROM src src1 INSERT OVERWRITE TABLE tmp2 SELECT src1.key, src1.value, src1.key, src1.value;

    SELECT * from tmp2;

    SELECT tmp1.* FROM tmp1 JOIN tmp2 ON (tmp1.key1 = tmp2.key1 AND tmp1.key2 = tmp2.key2 AND tmp1.value1\
    = tmp2.value1);

    DROP TABLE tmp1;
    DROP TABLE tmp2;



    Thanks,
    -namit


    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded the join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao wrote:
    This is unexpected. We did try similar cases before (with multiple join keys).

    Can you verify by finding out the actual rows in each table that will produce a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null), FieldSchema(name:b,type:int,comment:null), FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null), FieldSchema(name:b,type:int,comment:null), FieldSchema(name:c,type:int,comment:null), FieldSchema(name:cityid,type:int,comment:null), FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server name redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and z as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP BY ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng
  • Jeff Hammerbacher at Jan 27, 2009 at 3:00 am
    Hey Namit,

    We found the bug, reported a few messages up in the thread:

    """
    I solved the bug. As it turns out, there was trailing whitespace in some of
    my input files. So Hive wasn't matching the column's contents correctly,
    since Hive doesn't do an implicit trim() (my handwritten code did this). And
    of course, when I looked at the input files with less to verify their
    contents, the whitespace was invisible :P
    """

    Thanks,
    Jeff
    On Mon, Jan 26, 2009 at 6:54 PM, Namit Jain wrote:

    Can you send the data for the tables ? I tried the following and it works
    fine:



    CREATE TABLE tmp1(key1 INT, value1 STRING, key2 INT, value2 STRING) STORED
    AS TEXTFILE;

    FROM src src1 INSERT OVERWRITE TABLE tmp1 SELECT src1.key, src1.value,
    src1.key, src1.value;

    SELECT * from tmp1;

    CREATE TABLE tmp2(key1 INT, value1 STRING, key2 INT, value2 STRING) STORED
    AS TEXTFILE;

    FROM src src1 INSERT OVERWRITE TABLE tmp2 SELECT src1.key, src1.value,
    src1.key, src1.value;

    SELECT * from tmp2;

    SELECT tmp1.* FROM tmp1 JOIN tmp2 ON (tmp1.key1 = tmp2.key1 AND tmp1.key2 =
    tmp2.key2 AND tmp1.value1\
    = tmp2.value1);

    DROP TABLE tmp1;
    DROP TABLE tmp2;



    Thanks,
    -namit


    ________________________________________
    From: Aaron Kimball [aaron@cloudera.com]
    Sent: Friday, January 23, 2009 1:24 AM
    To: hive-user@hadoop.apache.org
    Subject: Re: equijoin with multiple columns?

    Not only can I verify that the rows are present, I broke down and coded the
    join in Java MapReduce by hand. Works.
    - Aaron

    On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao <zshao9@gmail.com<mailto:
    zshao9@gmail.com>> wrote:
    This is unexpected. We did try similar cases before (with multiple join
    keys).

    Can you verify by finding out the actual rows in each table that will
    produce a join result row?

    Zheng


    On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball <aaron@cloudera.com<mailto:
    aaron@cloudera.com>> wrote:
    Hi Zheng,

    I was eliding some information to try to boil things down a bit -- but here
    goes.


    To expand the white lie into the fuller debug dump:
    table A ~ ak_filtered_ips
    table B ~ ip_locations

    It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
    two.

    hive> describe extended ak_filtered_ips;
    OK
    a int
    b int
    c int
    Detailed Table Information:
    Table(tableName:ak_filtered_
    ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
    redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    hive> describe extended ip_locations;
    OK
    a int
    b int
    c int
    cityid int
    countryid int
    Detailed Table Information:
    Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
    FieldSchema(name:b,type:int,comment:null),
    FieldSchema(name:c,type:int,comment:null),
    FieldSchema(name:cityid,type:int,comment:null),
    FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
    name
    redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

    I actually just loaded in a super-small test case involving A, B, x, y, and
    z as I initially described. And that statement *does* work.

    The actual statement I was running into problems with is:

    SELECT ip_locations.cityid, ip_locations.countryid FROM
    ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
    AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
    ip_locations.c)

    I would eventually like to add a "count(1)" to the field list, and a "GROUP
    BY ip_locations.cityid, ip_locations.countryid" at the end.

    This SELECT statement returns no rows.

    Thanks,
    - Aaron


    On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao <zshao9@gmail.com<mailto:
    zshao9@gmail.com>> wrote:
    Can you do "describe extended a;" and "describe extended b;" and pste
    the result here?

    Zheng

    On 1/22/09, Aaron Kimball wrote:
    Hi all,

    I have a perplexing problem with joins in Hive. I have two tables "A" and
    "B". A contains two columns x and y. B also contains x and y, and also z. I
    want to select the values of z from B, for every (x, y) I see in table A.

    So I wrote this statement:

    SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

    This crunches away and then comes back with an empty resultset. I have
    exported the same data into MySQL and ran the same query; it works just fine
    there. Does Hive support multi-column join criteria?

    Thanks,
    - Aaron Kimball
    --
    Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com>

    Yours,
    Zheng




    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 23, '09 at 4:57a
activeJan 27, '09 at 3:00a
posts15
users8
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase