FAQ
I have a 10 node server or so, and have been mainly using pig on it, but
would like to try out Hive.

I am running this query, which doesn't take too long in Pig, but is taking
quite a long time in Hive.

hive -e "select count(1) as ct from my_table where v1='02' and v2 =
11112222;" > thecount

One thing is that this job only uses 1 reducer, but it is taking most of its
time in its reduce step. I tried manually setting more reducers, but I think
that for a job without groups, it forces 1 reducer?

Either way, would love to know why this is dragging? It's worth noting that
my_table is not saved in the Hive format, but rather as a flat file. I
realize that this can influence performance, but shouldn't it at least
perform on par with pig?

Thanks for your help
Jon

Search Discussions

  • Ajo Fod at Jan 24, 2011 at 9:54 pm
    you could try to set the number of reducers e.g:
    set mapred.reduce.tasks=4;

    set this before doing the select.

    -Ajo
    On Mon, Jan 24, 2011 at 1:13 PM, Jonathan Coveney wrote:
    I have a 10 node server or so, and have been mainly using pig on it, but
    would like to try out Hive.
    I am running this query, which doesn't take too long in Pig, but is taking
    quite a long time in Hive.

    hive -e "select count(1) as ct from my_table where v1='02' and v2 =
    11112222;" > thecount
    One thing is that this job only uses 1 reducer, but it is taking most of its
    time in its reduce step. I tried manually setting more reducers, but I think
    that for a job without groups, it forces 1 reducer?
    Either way, would love to know why this is dragging? It's worth noting that
    my_table is not saved in the Hive format, but rather as a flat file. I
    realize that this can influence performance, but shouldn't it at least
    perform on par with pig?
    Thanks for your help
    Jon
  • Ajo Fod at Jan 24, 2011 at 9:55 pm
    oh ... sorry you say you already tried that.


    On Mon, Jan 24, 2011 at 1:54 PM, Ajo Fod wrote:
    you could try to set the number of reducers e.g:
    set mapred.reduce.tasks=4;

    set this before doing the select.

    -Ajo
    On Mon, Jan 24, 2011 at 1:13 PM, Jonathan Coveney wrote:
    I have a 10 node server or so, and have been mainly using pig on it, but
    would like to try out Hive.
    I am running this query, which doesn't take too long in Pig, but is taking
    quite a long time in Hive.

    hive -e "select count(1) as ct from my_table where v1='02' and v2 =
    11112222;" > thecount
    One thing is that this job only uses 1 reducer, but it is taking most of its
    time in its reduce step. I tried manually setting more reducers, but I think
    that for a job without groups, it forces 1 reducer?
    Either way, would love to know why this is dragging? It's worth noting that
    my_table is not saved in the Hive format, but rather as a flat file. I
    realize that this can influence performance, but shouldn't it at least
    perform on par with pig?
    Thanks for your help
    Jon
  • Jonathan Coveney at Jan 24, 2011 at 10:01 pm
    Yes, I tried that, it looks like it forces it to 1 if there are no groups.

    2011/1/24 Ajo Fod <ajo.fod@gmail.com>
    oh ... sorry you say you already tried that.


    On Mon, Jan 24, 2011 at 1:54 PM, Ajo Fod wrote:
    you could try to set the number of reducers e.g:
    set mapred.reduce.tasks=4;

    set this before doing the select.

    -Ajo
    On Mon, Jan 24, 2011 at 1:13 PM, Jonathan Coveney wrote:
    I have a 10 node server or so, and have been mainly using pig on it, but
    would like to try out Hive.
    I am running this query, which doesn't take too long in Pig, but is
    taking
    quite a long time in Hive.

    hive -e "select count(1) as ct from my_table where v1='02' and v2 =
    11112222;" > thecount
    One thing is that this job only uses 1 reducer, but it is taking most of
    its
    time in its reduce step. I tried manually setting more reducers, but I
    think
    that for a job without groups, it forces 1 reducer?
    Either way, would love to know why this is dragging? It's worth noting
    that
    my_table is not saved in the Hive format, but rather as a flat file. I
    realize that this can influence performance, but shouldn't it at least
    perform on par with pig?
    Thanks for your help
    Jon
  • Ajo Fod at Jan 24, 2011 at 10:27 pm
    So, in pig this is reduced by 4 threads automatially? That would be interesting.

    I've usually used groups, but when there is only one group, you still
    have this problem where the CPU resources are underutilized.

    -Ajo.
    On Mon, Jan 24, 2011 at 2:01 PM, Jonathan Coveney wrote:
    Yes, I tried that, it looks like it forces it to 1 if there are no groups.

    2011/1/24 Ajo Fod <ajo.fod@gmail.com>
    oh ... sorry  you say you already tried that.


    On Mon, Jan 24, 2011 at 1:54 PM, Ajo Fod wrote:
    you could try to set the number of reducers e.g:
    set mapred.reduce.tasks=4;

    set this before doing the select.

    -Ajo

    On Mon, Jan 24, 2011 at 1:13 PM, Jonathan Coveney <jcoveney@gmail.com>
    wrote:
    I have a 10 node server or so, and have been mainly using pig on it,
    but
    would like to try out Hive.
    I am running this query, which doesn't take too long in Pig, but is
    taking
    quite a long time in Hive.

    hive -e "select count(1) as ct from my_table where v1='02' and v2 =
    11112222;" > thecount
    One thing is that this job only uses 1 reducer, but it is taking most
    of its
    time in its reduce step. I tried manually setting more reducers, but I
    think
    that for a job without groups, it forces 1 reducer?
    Either way, would love to know why this is dragging? It's worth noting
    that
    my_table is not saved in the Hive format, but rather as a flat file. I
    realize that this can influence performance, but shouldn't it at least
    perform on par with pig?
    Thanks for your help
    Jon
  • Namit Jain at Jan 24, 2011 at 11:12 pm
    Although there is only 1 reducer, the amount of data to that reducer
    should be really
    small: it will have same number of rows as the number of mappers.

    Can you check how much data is your reducer getting ?
    Is it reading a long time to read the small data from each mapper ?


    Thanks,
    -namit

    On 1/24/11 2:26 PM, "Ajo Fod" wrote:

    So, in pig this is reduced by 4 threads automatially? That would be
    interesting.

    I've usually used groups, but when there is only one group, you still
    have this problem where the CPU resources are underutilized.

    -Ajo.
    On Mon, Jan 24, 2011 at 2:01 PM, Jonathan Coveney wrote:
    Yes, I tried that, it looks like it forces it to 1 if there are no
    groups.

    2011/1/24 Ajo Fod <ajo.fod@gmail.com>
    oh ... sorry you say you already tried that.


    On Mon, Jan 24, 2011 at 1:54 PM, Ajo Fod wrote:
    you could try to set the number of reducers e.g:
    set mapred.reduce.tasks=4;

    set this before doing the select.

    -Ajo

    On Mon, Jan 24, 2011 at 1:13 PM, Jonathan Coveney
    <jcoveney@gmail.com>
    wrote:
    I have a 10 node server or so, and have been mainly using pig on it,
    but
    would like to try out Hive.
    I am running this query, which doesn't take too long in Pig, but is
    taking
    quite a long time in Hive.

    hive -e "select count(1) as ct from my_table where v1='02' and v2 =
    11112222;" > thecount
    One thing is that this job only uses 1 reducer, but it is taking
    most
    of its
    time in its reduce step. I tried manually setting more reducers,
    but I
    think
    that for a job without groups, it forces 1 reducer?
    Either way, would love to know why this is dragging? It's worth
    noting
    that
    my_table is not saved in the Hive format, but rather as a flat
    file. I
    realize that this can influence performance, but shouldn't it at
    least
    perform on par with pig?
    Thanks for your help
    Jon
  • Ajo Fod at Jan 25, 2011 at 3:46 pm
    One more thing to try:
    http://www.karmasphere.com/Karmasphere-Analyst/hive-queries-on-table-data.html#multi_group_by_inserts


    Look for this text:
    *"hive.map.aggr* controls how we do aggregations"

    Let me know if this hint helps.

    Cheers,
    Ajo.
    On Mon, Jan 24, 2011 at 2:01 PM, Jonathan Coveney wrote:

    Yes, I tried that, it looks like it forces it to 1 if there are no groups.

    2011/1/24 Ajo Fod <ajo.fod@gmail.com>

    oh ... sorry you say you already tried that.

    On Mon, Jan 24, 2011 at 1:54 PM, Ajo Fod wrote:
    you could try to set the number of reducers e.g:
    set mapred.reduce.tasks=4;

    set this before doing the select.

    -Ajo

    On Mon, Jan 24, 2011 at 1:13 PM, Jonathan Coveney <jcoveney@gmail.com>
    wrote:
    I have a 10 node server or so, and have been mainly using pig on it,
    but
    would like to try out Hive.
    I am running this query, which doesn't take too long in Pig, but is
    taking
    quite a long time in Hive.

    hive -e "select count(1) as ct from my_table where v1='02' and v2 =
    11112222;" > thecount
    One thing is that this job only uses 1 reducer, but it is taking most
    of its
    time in its reduce step. I tried manually setting more reducers, but I
    think
    that for a job without groups, it forces 1 reducer?
    Either way, would love to know why this is dragging? It's worth noting
    that
    my_table is not saved in the Hive format, but rather as a flat file. I
    realize that this can influence performance, but shouldn't it at least
    perform on par with pig?
    Thanks for your help
    Jon

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 24, '11 at 9:14p
activeJan 25, '11 at 3:46p
posts7
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase