FAQ
We could add trim to hive load, but, unfortunately it has been patented by
IBM!

"Method of automatically removing leading and trailing space characters
from data being entered into a database system"

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1
&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=7,475,086.PN.&OS=PN/7,475
,086&RS=PN/7,475,086

"Retrieving data from a database system without leading and trailing
space characters"

http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=
1&u=/netahtml/PTO/srchnum.html&r=1&f=G&l=50&s1="20070282820".PGNR.&OS=DN/200
70282820&RS=DN/20070282820

On 1/23/09 3:55 PM, "Aaron Kimball" wrote:

I solved the bug. As it turns out, there was trailing whitespace in some of my
input files. So Hive wasn't matching the column's contents correctly, since
Hive doesn't do an implicit trim() (my handwritten code did this). And of
course, when I looked at the input files with less to verify their contents,
the whitespace was invisible :P

- Aaron
On Fri, Jan 23, 2009 at 3:48 AM, Ashish Thusoo wrote:
Aaron

can you send the plan for this query..

explain extended <your query>

This is very unexpected..

Thanks,
Ashish
________________________________________
From: Aaron Kimball [aaron@cloudera.com]
Sent: Friday, January 23, 2009 1:24 AM
To: hive-user@hadoop.apache.org
Subject: Re: equijoin with multiple columns?

Not only can I verify that the rows are present, I broke down and coded the
join in Java MapReduce by hand. Works.
- Aaron

On Thu, Jan 22, 2009 at 11:42 PM, Zheng Shao
wrote:
This is unexpected. We did try similar cases before (with multiple join
keys).

Can you verify by finding out the actual rows in each table that will produce
a join result row?

Zheng


On Thu, Jan 22, 2009 at 6:09 PM, Aaron Kimball
wrote:
Hi Zheng,

I was eliding some information to try to boil things down a bit -- but here
goes.


To expand the white lie into the fuller debug dump:
table A ~ ak_filtered_ips
table B ~ ip_locations

It's actually three fields ('a', 'b', and 'c') that I'm joining on, not two.

hive> describe extended ak_filtered_ips;
OK
a int
b int
c int
Detailed Table Information:
Table(tableName:ak_filtered_
ips,dbName:default,owner:aaron,createTime:1232661581,lastAccessTime:0,retenti>>
o
n:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,type:int,comment:null),
FieldSchema(name:b,type:int,comment:null),
FieldSchema(name:c,type:int,comment:null)],location:hdfs://(server name
redacted):9000/user/hive/warehouse/ak_filtered_ips,inputFormat:org.apache.had>>
o
>>
op.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKey>>
T
>>
extOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,>>
s
>>
erializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,par>>
a
>>
meters:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),par>>
t
itionKeys:[],parameters:{})

hive> describe extended ip_locations;
OK
a int
b int
c int
cityid int
countryid int
Detailed Table Information:
Table(tableName:ip_locations,dbName:default,owner:aaron,createTime:1232596116>>
,
>>
lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:a,ty>>
p
e:int,comment:null), FieldSchema(name:b,type:int,comment:null),
FieldSchema(name:c,type:int,comment:null),
FieldSchema(name:cityid,type:int,comment:null),
FieldSchema(name:countryid,type:int,comment:null)],location:hdfs://(server
name
redacted):9000/user/hive/warehouse/ip_locations,inputFormat:org.apache.hadoop>>
.
>>
mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTex>>
t
>>
OutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,ser>>
i
>>
alizationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parame>>
t
>>
ers:{serialization.format=1}),bucketCols:[],sortCols:[],parameters:{}),partit>>
i
onKeys:[],parameters:{})

I actually just loaded in a super-small test case involving A, B, x, y, and z
as I initially described. And that statement *does* work.

The actual statement I was running into problems with is:

SELECT ip_locations.cityid, ip_locations.countryid FROM
ak_filtered_ips JOIN ip_locations ON (ak_filtered_ips.a = ip_locations.a AND
ak_filtered_ips.b = ip_locations.b AND ak_filtered_ips.c = ip_locations.c)

I would eventually like to add a "count(1)" to the field list, and a "GROUP
BY
ip_locations.cityid, ip_locations.countryid" at the end.

This SELECT statement returns no rows.

Thanks,
- Aaron


On Thu, Jan 22, 2009 at 4:27 PM, Zheng Shao
wrote:
Can you do "describe extended a;" and "describe extended b;" and pste
the result here?

Zheng

On 1/22/09, Aaron Kimball wrote:
Hi all,

I have a perplexing problem with joins in Hive. I have two tables "A" and
"B". A contains two columns x and y. B also contains x and y, and also z.
I
want to select the values of z from B, for every (x, y) I see in table A.

So I wrote this statement:

SELECT b.z FROM a JOIN b ON (a.x = b.x AND a.y = b.y);

This crunches away and then comes back with an empty resultset. I have
exported the same data into MySQL and ran the same query; it works just fine
there. Does Hive support multi-column join criteria?

Thanks,
- Aaron Kimball
--
Sent from Gmail for mobile | mobile.google.com <http://mobile.google.com>
<http://mobile.google.com>

Yours,
Zheng




--
Yours,
Zheng

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 8 of 15 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 23, '09 at 4:57a
activeJan 27, '09 at 3:00a
posts15
users8
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase