Grokbase Groups Hive user March 2010
Hadoop: 0.20.1+152 (cloudera)
Hive: 0.4.2

*Query (simplified for discussion):*
FROM (SELECT c1, c2, day FROM tbl1 WHERE day >= 2010-02-01 AND day <=
2010-02-20) base_qry
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-01
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-02
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-20

Run a hive query on day N (execution day 2010-02-21) over table "tbl1"
partitions day N-1 to N-21 and refresh/overwrite "day" partition of "tbl2"
for days N-1 to N-21.

*Some Facts:*
- Each "day" partition of "tbl1" has 24 files.
- The last Hadoop job of base_qry is map only job (and we want it to be a
map only job for efficiency reasons)

The query execution resulted into 480 files (24*20) for EACH "day" partition
of "tbl2" with 24 non-empty files and 456 empty files.

Now user running query like "SELECT blah FROM tbl2 WHERE day >= 2010-02-01
AND day <= 2010-02-20" suddenly starts with 480*20 = 9600 mappers (expected
480) and it gets worst with bigger date range.

As a quick fix I am running a FOR loop for each day's partition (or I could
delete empty files as post-processing). Any other suggestions?

- I believe Hive 0.5 handles small/empty files well and then this might not
be a big issue?
- If we can fix this behavior, I will be happy to spend some time on this.
Any inputs/guidance?


Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 15, '10 at 9:22p
activeMar 15, '10 at 9:22p

1 user in discussion

Abhijit Pol: 1 post



site design / logo © 2022 Grokbase