Grokbase Groups Hive user March 2011
FAQ
I understand that Hive and Hadoop are meant to run many jobs at once. As a
result, most tuning parameters are meant to increase the throughput of a
Hadoop cluster rather than latency. In our case, we use Elastic Map Reduce
to run a single Hive script on a daily basis. For that reason, our top
priority is to make the script run faster. So far, it's been a pretty
frustrating experience. I am curious if there are workarounds for the things
that are not easy to tune:

1) In particular, Hadoop lets you
configure mapred.tasktracker.map/reduce.tasks.maximum individually but there
is no way to limit the total of the two. Hive mappers seem to always finish
before the reducers and I wish I could run 1 more reducer when no mappers
are running at the same time. That doesn't seem to be possible.

2) Similarly, there is only one parameter to control memory
allocation: mapred.child.java.opts. So if my box is configured for 4 mappers
and 2 reducers, I have to set that parameter to less than 1/6 of total
memory available. The only problem is that once the mappers are done, 4/6th
or two thirds of all memory is essentially not being used. Is there
something I can do about that?

3) Another odd thing is not being able to run a single wave of reducers
easily. As I understand that's the optimal scenario in most cases. To make
this work, I have to know the total number of reducer slots in the cluster
and then define mapred.reduce.tasks accordingly. EMR seems to have a
solution for this problem (mapred.reduce.tasksperslot) but it doesn't seem
to work.

Any suggestions would be greatly appreciated!

Thank you,
igor

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 2 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 9, '11 at 7:51p
activeMar 14, '11 at 9:26p
posts2
users2
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase