Grokbase Groups Hive user March 2011
FAQ
I understand that Hive and Hadoop are meant to run many jobs at once. As a
result, most tuning parameters are meant to increase the throughput of a
Hadoop cluster rather than latency. In our case, we use Elastic Map Reduce
to run a single Hive script on a daily basis. For that reason, our top
priority is to make the script run faster. So far, it's been a pretty
frustrating experience. I am curious if there are workarounds for the things
that are not easy to tune:

1) In particular, Hadoop lets you
configure mapred.tasktracker.map/reduce.tasks.maximum individually but there
is no way to limit the total of the two. Hive mappers seem to always finish
before the reducers and I wish I could run 1 more reducer when no mappers
are running at the same time. That doesn't seem to be possible.

2) Similarly, there is only one parameter to control memory
allocation: mapred.child.java.opts. So if my box is configured for 4 mappers
and 2 reducers, I have to set that parameter to less than 1/6 of total
memory available. The only problem is that once the mappers are done, 4/6th
or two thirds of all memory is essentially not being used. Is there
something I can do about that?

3) Another odd thing is not being able to run a single wave of reducers
easily. As I understand that's the optimal scenario in most cases. To make
this work, I have to know the total number of reducer slots in the cluster
and then define mapred.reduce.tasks accordingly. EMR seems to have a
solution for this problem (mapred.reduce.tasksperslot) but it doesn't seem
to work.

Any suggestions would be greatly appreciated!

Thank you,
igor

Search Discussions

  • Andrew Hitchcock at Mar 14, 2011 at 9:26 pm
    Hi,

    Quick note on #3. In order to make mapred.reduce.tasksperslot work,
    you need to completely remove all mentions of mapred.reduce.tasks from
    your configuration (including removing it from the default config
    file). Tasksperslot only takes effect as a last resort.

    Andrew
    On Wed, Mar 9, 2011 at 11:50 AM, Igor Tatarinov wrote:
    I understand that Hive and Hadoop are meant to run many jobs at once. As a
    result, most tuning parameters are meant to increase the throughput of a
    Hadoop cluster rather than latency. In our case, we use Elastic Map Reduce
    to run a single Hive script on a daily basis. For that reason, our top
    priority is to make the script run faster. So far, it's been a pretty
    frustrating experience. I am curious if there are workarounds for the things
    that are not easy to tune:
    1) In particular, Hadoop lets you
    configure mapred.tasktracker.map/reduce.tasks.maximum individually but there
    is no way to limit the total of the two. Hive mappers seem to always finish
    before the reducers and I wish I could run 1 more reducer when no mappers
    are running at the same time. That doesn't seem to be possible.
    2) Similarly, there is only one parameter to control memory
    allocation: mapred.child.java.opts. So if my box is configured for 4 mappers
    and 2 reducers, I have to set that parameter to less than 1/6 of total
    memory available. The only problem is that once the mappers are done, 4/6th
    or two thirds of all memory is essentially not being used. Is there
    something I can do about that?
    3) Another odd thing is not being able to run a single wave of reducers
    easily. As I understand that's the optimal scenario in most cases. To make
    this work, I have to know the total number of reducer slots in the cluster
    and then define mapred.reduce.tasks accordingly. EMR seems to have a
    solution for this problem (mapred.reduce.tasksperslot) but it doesn't seem
    to work.
    Any suggestions would be greatly appreciated!
    Thank you,
    igor

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 9, '11 at 7:51p
activeMar 14, '11 at 9:26p
posts2
users2
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase