Hadoop: 0.20.1+152 (cloudera)
*Query (simplified for discussion):*
FROM (SELECT c1, c2, day FROM tbl1 WHERE day >= 2010-02-01 AND day <=
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-01
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-02
INSERT OVERWRITE TABLE tbl2 SELECT base_qry.* WHERE day = 2010-02-20
Run a hive query on day N (execution day 2010-02-21) over table "tbl1"
partitions day N-1 to N-21 and refresh/overwrite "day" partition of "tbl2"
for days N-1 to N-21.
- Each "day" partition of "tbl1" has 24 files.
- The last Hadoop job of base_qry is map only job (and we want it to be a
map only job for efficiency reasons)
The query execution resulted into 480 files (24*20) for EACH "day" partition
of "tbl2" with 24 non-empty files and 456 empty files.
Now user running query like "SELECT blah FROM tbl2 WHERE day >= 2010-02-01
AND day <= 2010-02-20" suddenly starts with 480*20 = 9600 mappers (expected
480) and it gets worst with bigger date range.
As a quick fix I am running a FOR loop for each day's partition (or I could
delete empty files as post-processing). Any other suggestions?
- I believe Hive 0.5 handles small/empty files well and then this might not
be a big issue?
- If we can fix this behavior, I will be happy to spend some time on this.