FAQ
I solved the bug. As it turns out, there was trailing whitespace in some of
my input files. So Hive wasn't matching the column's contents correctly,
since Hive doesn't do an implicit trim() (my handwritten code did this). And
of course, when I looked at the input files with less to verify their
contents, the whitespace was invisible :P

- Aaron
On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo wrote:

Aaron

can you send the plan for this query..

explain extended <your query>

This is very unexpected..

Thanks,
Ashish
________________________________________
From: Aaron Kimball [aaron@cloudera.com]
Sent: Friday, January 23, 2009 1:24 AM
To: hive-user@hadoop.apache.org
Subject: Re: equijoin with multiple columns?

Not only can I verify that the rows are present, I broke down and coded the
join in Java MapReduce by hand. Works.
- Aaron

On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao <zshao9@gmail.com<mailto:
zshao9@gmail.com>> wrote:
This is unexpected. We did try similar cases before (with multiple join
keys).

Can you verify by finding out the actual rows in each table that will
produce a join result row?

Zheng


On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball <aaron@cloudera.com<mailto:
aaron@cloudera.com>> wrote:
Hi Zheng,

I was eliding some information to try to boil things down a bit -- but here
goes.


To expand the white lie into the fuller debug dump:
table A ~ ak_filtered_ips
table B ~ ip_locations

It's actually three fields ('a', 'b', and 'c') that I'm joining on, not
two.

hive> describe extended ak_filtered_ips;
OK
a int
b int
c int
Detailed Table Information:
Table(tableName:ak_filtered_
ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
FieldSchema(name:b,type:int,comment:null),
FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

hive> describe extended ip_locations;
OK
a int
b int
c int
cityid int
countryid int
Detailed Table Information:
Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
FieldSchema(name:b,type:int,comment:null),
FieldSchema(name:c,type:int,comment:null),
FieldSchema(name:cityid,type:int,comment:null),
FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
name
redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

I actually just loaded in a super-small test case involving A, B, x, y, and
z as I initially described. And that statement *does* work.

The actual statement I was running into problems with is:

SELECT ip_locations.cityid, ip_locations.countryid FROM
ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a
AND ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c =
ip_locations.c)

I would eventually like to add a "count(1)" to the field list, and a "GROUP
BY ip_locations.cityid, ip_locations.countryid" at the end.

This SELECT statement returns no rows.

Thanks,
- Aaron


On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao <zshao9@gmail.com<mailto:
zshao9@gmail.com>> wrote:
Can you do "describe extended a;" and "describe extended b;" and pste
the result here?

Zheng

On 1/22/09, Aaron Kimball wrote:
Hi all,

I have a perplexing problem with joins in Hive. I have two tables "A" and
"B". A contains two columns x and y. B also contains x and y, and also z. I
want to select the values of z from B, for every (x, y) I see in table A.

So I wrote this statement:

SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

This crunches away and then comes back with an empty resultset. I have
exported the same data into MySQL and ran the same query; it works just fine
there. Does Hive support multi-column join criteria?

Thanks,
- Aaron Kimball
--
Sent from Gmail for mobile | mobile.google.com<http://mobile.google.com>

Yours,
Zheng




--
Yours,
Zheng

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 7 of 15 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 23, '09 at 4:57a
activeJan 27, '09 at 3:00a
posts15
users8
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase