I use flume to store log file , and use hive to query.
Flume always store the small file with suffix .seq Now I have over 35
thousand seq files. Every time when I launch query script, 35 thousand map
tasks will be created and it's so long time to wait for completing.
I also try to set CombineHiveInputFormat, but if I set this option, it seems
the task will be executed slowly. Because total size of the data folder over
700M. Now in my testing env, I only have 3 data nodes. I also tried to add
mapred.map.tasks=5 after the CombineHiveInputFormat setting, seems doesn't
work. There's alway only one map task if set CombineHiveInputFormat.
Can you plz show me a solution in which I can set map task number freely
BTW: version for hadoop is 20 and hive is 0.5