Grokbase Groups Hive user May 2011
FAQ
Today I tried CombineHiveInputFormat and set the max split size for hadoop
input. Seems I can get the expected map tasks number. But another problem is
the cpu is consumed highly by map tasks. almost 100%.

I just ran a query with simple WHERE condition over testing files,whose
total size is about 30M and there are about 10 thousand small files. The
execution time over 700s. It's killing us. Because the files are generated
by flume, all files is seq file.


R
On Tue, May 31, 2011 at 2:55 AM, Junxian Yan wrote:

Hi Guys

I use flume to store log file , and use hive to query.

Flume always store the small file with suffix .seq Now I have over 35
thousand seq files. Every time when I launch query script, 35 thousand map
tasks will be created and it's so long time to wait for completing.

I also try to set CombineHiveInputFormat, but if I set this option, it
seems the task will be executed slowly. Because total size of the data
folder over 700M. Now in my testing env, I only have 3 data nodes. I also
tried to add mapred.map.tasks=5 after the CombineHiveInputFormat setting,
seems doesn't work. There's alway only one map task if
set CombineHiveInputFormat.

Can you plz show me a solution in which I can set map task number freely

BTW: version for hadoop is 20 and hive is 0.5

Richard

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 4 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 31, '11 at 9:56a
activeJun 1, '11 at 7:38p
posts4
users3
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase