Grokbase Groups Hive user March 2011
FAQ
Hi Experts
I'm currently working with hive 0.7 mostly with JOINS. In all permissible
cases i'm using map joins by setting the hive.auto.convert.join=true parameter.
Usage of local map joins have made a considerable performance improvement in
hive queries.I have used this local map join only on the default set of hive
configuration parameters now i'd try to dig more deeper into this. Want to try
out this local map join on little bigger tables with more no of rows. Given
below is a failure log of one of my local map tasks and in turn executing its
back up common join task

2011-03-31 09:56:54 Starting to launch local task to process map join;
maximum memory = 932118528
2011-03-31 09:56:57 Processing rows: 200000 Hashtable size: 199999
Memory usage: 115481024 rate: 0.124
2011-03-31 09:57:00 Processing rows: 300000 Hashtable size: 299999
Memory usage: 169344064 rate: 0.182
2011-03-31 09:57:03 Processing rows: 400000 Hashtable size: 399999
Memory usage: 232132792 rate: 0.249
2011-03-31 09:57:06 Processing rows: 500000 Hashtable size: 499999
Memory usage: 282338544 rate: 0.303
2011-03-31 09:57:10 Processing rows: 600000 Hashtable size: 599999
Memory usage: 336738640 rate: 0.361
2011-03-31 09:57:14 Processing rows: 700000 Hashtable size: 699999
Memory usage: 391117888 rate: 0.42
2011-03-31 09:57:22 Processing rows: 800000 Hashtable size: 799999
Memory usage: 453906496 rate: 0.487
2011-03-31 09:57:27 Processing rows: 900000 Hashtable size: 899999
Memory usage: 508306552 rate: 0.545
2011-03-31 09:57:34 Processing rows: 1000000 Hashtable size: 999999
Memory usage: 562706496 rate: 0.604
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.MapredLocalTask
ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
Launching Job 4 out of 6


Here i"d like to make this local map task running, for the same i tried setting
the following hive parameters as
hive -f HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
hive.mapjoin.smalltable.filesize=40000000 -hiveconf hive.auto.convert.join=true
Butting setting the two config parameters doesn't make my local map task proceed
beyond this stage. I didn't try out

overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my task
log shows that the memory usage rate is just 0.604, so i assume setting the same
with a larger value wont cater to a solution in my case.Could some one please
guide me what are the actual parameters and the values I should set to get
things rolling.


Thank You

Regards
Bejoy.K.S

Search Discussions

  • Yongqiang he at Mar 31, 2011 at 11:25 pm
    You possibly got a OOM error when processing the small tables. OOM is
    a fatal error that can not be controlled by the hive configs. So can
    you try to increase your memory setting?

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks wrote:
    Hi Experts
    I'm currently working with hive 0.7 mostly with JOINS. In all
    permissible cases i'm using map joins by setting the
    hive.auto.convert.join=true  parameter. Usage of local map joins have made a
    considerable performance improvement in hive queries.I have used this local
    map join only on the default set of hive configuration parameters now i'd
    try to dig more deeper into this. Want to try out this local map join on
    little bigger tables with more no of rows. Given below is a failure log of
    one of my local map tasks and in turn executing its back up common join task

    2011-03-31 09:56:54     Starting to launch local task to process map
    join;      maximum memory = 932118528
    2011-03-31 09:56:57     Processing rows:        200000  Hashtable size:
    199999  Memory usage:   115481024       rate:   0.124
    2011-03-31 09:57:00     Processing rows:        300000  Hashtable size:
    299999  Memory usage:   169344064       rate:   0.182
    2011-03-31 09:57:03     Processing rows:        400000  Hashtable size:
    399999  Memory usage:   232132792       rate:   0.249
    2011-03-31 09:57:06     Processing rows:        500000  Hashtable size:
    499999  Memory usage:   282338544       rate:   0.303
    2011-03-31 09:57:10     Processing rows:        600000  Hashtable size:
    599999  Memory usage:   336738640       rate:   0.361
    2011-03-31 09:57:14     Processing rows:        700000  Hashtable size:
    699999  Memory usage:   391117888       rate:   0.42
    2011-03-31 09:57:22     Processing rows:        800000  Hashtable size:
    799999  Memory usage:   453906496       rate:   0.487
    2011-03-31 09:57:27     Processing rows:        900000  Hashtable size:
    899999  Memory usage:   508306552       rate:   0.545
    2011-03-31 09:57:34     Processing rows:        1000000 Hashtable size:
    999999  Memory usage:   562706496       rate:   0.604
    FAILED: Execution Error, return code 2 from
    org.apache.hadoop.hive.ql.exec.MapredLocalTask
    ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
    Launching Job 4 out of 6


    Here i"d like to make this local map task running, for the same i tried
    setting the following hive parameters as
    hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
    hive.mapjoin.smalltable.filesize=40000000 -hiveconf
    hive.auto.convert.join=true
    Butting setting the two config parameters doesn't make my local map task
    proceed beyond this stage.  I didn't try out
    overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
    task log shows that the memory usage rate is just 0.604, so i assume setting
    the same with a larger value wont cater to a solution in my case.Could some
    one please guide me what are the actual parameters and the values I should
    set to get things rolling.

    Thank You

    Regards
    Bejoy.K.S
  • Bejoy_ks at Apr 1, 2011 at 5:14 am
    Thanks Yongqiang for your reply. I'm running a hive script which has nearly 10 joins within. From those joins all map joins(9 of them involves one small table) involving smaller tables are running fine. Just 1 join is on two larger tables and this map join fails, however since the back up task(common join) is executed successfully the whole hive job runs to completion successfully.
    In brief my hive job is running successfully now, but I just want to get the failed map join as well running instead of the common join being executed. I'm curious to see what would be the performance improvement out there with this difference in execution.
    To get a map join executed on larger tables do I have to for memory parameters with hadoop?
    Since my entire task is already running to completion and I want get just a map join working, shouldn't altering some hive map join parameters do my job?
    Please advise


    Regards
    Bejoy K S

    -----Original Message-----
    From: yongqiang he <heyongqiangict@gmail.com>
    Date: Thu, 31 Mar 2011 16:25:03
    To: <user@hive.apache.org>
    Reply-To: user@hive.apache.org
    Subject: Re: Hive map join - process a little larger tables with moderate
    number of rows

    You possibly got a OOM error when processing the small tables. OOM is
    a fatal error that can not be controlled by the hive configs. So can
    you try to increase your memory setting?

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks wrote:
    Hi Experts
    I'm currently working with hive 0.7 mostly with JOINS. In all
    permissible cases i'm using map joins by setting the
    hive.auto.convert.join=true  parameter. Usage of local map joins have made a
    considerable performance improvement in hive queries.I have used this local
    map join only on the default set of hive configuration parameters now i'd
    try to dig more deeper into this. Want to try out this local map join on
    little bigger tables with more no of rows. Given below is a failure log of
    one of my local map tasks and in turn executing its back up common join task

    2011-03-31 09:56:54     Starting to launch local task to process map
    join;      maximum memory = 932118528
    2011-03-31 09:56:57     Processing rows:        200000  Hashtable size:
    199999  Memory usage:   115481024       rate:   0.124
    2011-03-31 09:57:00     Processing rows:        300000  Hashtable size:
    299999  Memory usage:   169344064       rate:   0.182
    2011-03-31 09:57:03     Processing rows:        400000  Hashtable size:
    399999  Memory usage:   232132792       rate:   0.249
    2011-03-31 09:57:06     Processing rows:        500000  Hashtable size:
    499999  Memory usage:   282338544       rate:   0.303
    2011-03-31 09:57:10     Processing rows:        600000  Hashtable size:
    599999  Memory usage:   336738640       rate:   0.361
    2011-03-31 09:57:14     Processing rows:        700000  Hashtable size:
    699999  Memory usage:   391117888       rate:   0.42
    2011-03-31 09:57:22     Processing rows:        800000  Hashtable size:
    799999  Memory usage:   453906496       rate:   0.487
    2011-03-31 09:57:27     Processing rows:        900000  Hashtable size:
    899999  Memory usage:   508306552       rate:   0.545
    2011-03-31 09:57:34     Processing rows:        1000000 Hashtable size:
    999999  Memory usage:   562706496       rate:   0.604
    FAILED: Execution Error, return code 2 from
    org.apache.hadoop.hive.ql.exec.MapredLocalTask
    ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
    Launching Job 4 out of 6


    Here i"d like to make this local map task running, for the same i tried
    setting the following hive parameters as
    hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
    hive.mapjoin.smalltable.filesize=40000000 -hiveconf
    hive.auto.convert.join=true
    Butting setting the two config parameters doesn't make my local map task
    proceed beyond this stage.  I didn't try out
    overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
    task log shows that the memory usage rate is just 0.604, so i assume setting
    the same with a larger value wont cater to a solution in my case.Could some
    one please guide me what are the actual parameters and the values I should
    set to get things rolling.

    Thank You

    Regards
    Bejoy.K.S
  • Yongqiang he at Apr 1, 2011 at 5:27 am
    Can you try this one "hive.mapred.local.mem" (in MB)? It is to control
    the heapsize of the join's local child process.
    You can also try to increase the HADOOP_HEAPSIZE for your hive client.

    But these all depends on how big is your small file.

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 10:15 PM, wrote:
    Thanks Yongqiang for your reply. I'm running a hive script which has nearly 10 joins within. From those joins all map joins(9 of them involves one small table) involving smaller tables are running fine. Just 1 join is on two larger tables and this map join fails, however since the back up task(common join) is executed successfully the whole hive job runs to completion successfully.
    In brief my hive job is running successfully now, but I just want to get the failed map join as well running instead of the common join being executed. I'm curious to see what would be the performance improvement out there with this difference in execution.
    To get a map join executed on larger tables do I have to for memory parameters with hadoop?
    Since my entire task is already running to completion and I want get just a map join working, shouldn't altering some hive map join parameters do my job?
    Please advise


    Regards
    Bejoy K S

    -----Original Message-----
    From: yongqiang he <heyongqiangict@gmail.com>
    Date: Thu, 31 Mar 2011 16:25:03
    To: <user@hive.apache.org>
    Reply-To: user@hive.apache.org
    Subject: Re: Hive map join - process a little larger tables with moderate
    number of rows

    You possibly got a OOM error when processing the small tables. OOM is
    a fatal error that can not be controlled by the hive configs. So can
    you try to increase your memory setting?

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks wrote:
    Hi Experts
    I'm currently working with hive 0.7 mostly with JOINS. In all
    permissible cases i'm using map joins by setting the
    hive.auto.convert.join=true  parameter. Usage of local map joins have made a
    considerable performance improvement in hive queries.I have used this local
    map join only on the default set of hive configuration parameters now i'd
    try to dig more deeper into this. Want to try out this local map join on
    little bigger tables with more no of rows. Given below is a failure log of
    one of my local map tasks and in turn executing its back up common join task

    2011-03-31 09:56:54     Starting to launch local task to process map
    join;      maximum memory = 932118528
    2011-03-31 09:56:57     Processing rows:        200000  Hashtable size:
    199999  Memory usage:   115481024       rate:   0.124
    2011-03-31 09:57:00     Processing rows:        300000  Hashtable size:
    299999  Memory usage:   169344064       rate:   0.182
    2011-03-31 09:57:03     Processing rows:        400000  Hashtable size:
    399999  Memory usage:   232132792       rate:   0.249
    2011-03-31 09:57:06     Processing rows:        500000  Hashtable size:
    499999  Memory usage:   282338544       rate:   0.303
    2011-03-31 09:57:10     Processing rows:        600000  Hashtable size:
    599999  Memory usage:   336738640       rate:   0.361
    2011-03-31 09:57:14     Processing rows:        700000  Hashtable size:
    699999  Memory usage:   391117888       rate:   0.42
    2011-03-31 09:57:22     Processing rows:        800000  Hashtable size:
    799999  Memory usage:   453906496       rate:   0.487
    2011-03-31 09:57:27     Processing rows:        900000  Hashtable size:
    899999  Memory usage:   508306552       rate:   0.545
    2011-03-31 09:57:34     Processing rows:        1000000 Hashtable size:
    999999  Memory usage:   562706496       rate:   0.604
    FAILED: Execution Error, return code 2 from
    org.apache.hadoop.hive.ql.exec.MapredLocalTask
    ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
    Launching Job 4 out of 6


    Here i"d like to make this local map task running, for the same i tried
    setting the following hive parameters as
    hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
    hive.mapjoin.smalltable.filesize=40000000 -hiveconf
    hive.auto.convert.join=true
    Butting setting the two config parameters doesn't make my local map task
    proceed beyond this stage.  I didn't try out
    overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
    task log shows that the memory usage rate is just 0.604, so i assume setting
    the same with a larger value wont cater to a solution in my case.Could some
    one please guide me what are the actual parameters and the values I should
    set to get things rolling.

    Thank You

    Regards
    Bejoy.K.S
  • Bejoy_ks at Apr 1, 2011 at 2:09 pm
    Thanks Yongqiang . I worked for me and I was able to evaluate the performance. It proved to be expensive :)
    Regards
    Bejoy K S

    -----Original Message-----
    From: yongqiang he <heyongqiangict@gmail.com>
    Date: Thu, 31 Mar 2011 22:27:26
    To: <user@hive.apache.org>; <bejoy_ks@yahoo.com>
    Reply-To: user@hive.apache.org
    Subject: Re: Hive map join - process a little larger tables with
    moderatenumber of rows

    Can you try this one "hive.mapred.local.mem" (in MB)? It is to control
    the heapsize of the join's local child process.
    You can also try to increase the HADOOP_HEAPSIZE for your hive client.

    But these all depends on how big is your small file.

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 10:15 PM, wrote:
    Thanks Yongqiang for your reply. I'm running a hive script which has nearly 10 joins within. From those joins all map joins(9 of them involves one small table) involving smaller tables are running fine. Just 1 join is on two larger tables and this map join fails, however since the back up task(common join) is executed successfully the whole hive job runs to completion successfully.
    In brief my hive job is running successfully now, but I just want to get the failed map join as well running instead of the common join being executed. I'm curious to see what would be the performance improvement out there with this difference in execution.
    To get a map join executed on larger tables do I have to for memory parameters with hadoop?
    Since my entire task is already running to completion and I want get just a map join working, shouldn't altering some hive map join parameters do my job?
    Please advise


    Regards
    Bejoy K S

    -----Original Message-----
    From: yongqiang he <heyongqiangict@gmail.com>
    Date: Thu, 31 Mar 2011 16:25:03
    To: <user@hive.apache.org>
    Reply-To: user@hive.apache.org
    Subject: Re: Hive map join - process a little larger tables with moderate
    number of rows

    You possibly got a OOM error when processing the small tables. OOM is
    a fatal error that can not be controlled by the hive configs. So can
    you try to increase your memory setting?

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks wrote:
    Hi Experts
    I'm currently working with hive 0.7 mostly with JOINS. In all
    permissible cases i'm using map joins by setting the
    hive.auto.convert.join=true  parameter. Usage of local map joins have made a
    considerable performance improvement in hive queries.I have used this local
    map join only on the default set of hive configuration parameters now i'd
    try to dig more deeper into this. Want to try out this local map join on
    little bigger tables with more no of rows. Given below is a failure log of
    one of my local map tasks and in turn executing its back up common join task

    2011-03-31 09:56:54     Starting to launch local task to process map
    join;      maximum memory = 932118528
    2011-03-31 09:56:57     Processing rows:        200000  Hashtable size:
    199999  Memory usage:   115481024       rate:   0.124
    2011-03-31 09:57:00     Processing rows:        300000  Hashtable size:
    299999  Memory usage:   169344064       rate:   0.182
    2011-03-31 09:57:03     Processing rows:        400000  Hashtable size:
    399999  Memory usage:   232132792       rate:   0.249
    2011-03-31 09:57:06     Processing rows:        500000  Hashtable size:
    499999  Memory usage:   282338544       rate:   0.303
    2011-03-31 09:57:10     Processing rows:        600000  Hashtable size:
    599999  Memory usage:   336738640       rate:   0.361
    2011-03-31 09:57:14     Processing rows:        700000  Hashtable size:
    699999  Memory usage:   391117888       rate:   0.42
    2011-03-31 09:57:22     Processing rows:        800000  Hashtable size:
    799999  Memory usage:   453906496       rate:   0.487
    2011-03-31 09:57:27     Processing rows:        900000  Hashtable size:
    899999  Memory usage:   508306552       rate:   0.545
    2011-03-31 09:57:34     Processing rows:        1000000 Hashtable size:
    999999  Memory usage:   562706496       rate:   0.604
    FAILED: Execution Error, return code 2 from
    org.apache.hadoop.hive.ql.exec.MapredLocalTask
    ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
    Launching Job 4 out of 6


    Here i"d like to make this local map task running, for the same i tried
    setting the following hive parameters as
    hive -f  HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
    hive.mapjoin.smalltable.filesize=40000000 -hiveconf
    hive.auto.convert.join=true
    Butting setting the two config parameters doesn't make my local map task
    proceed beyond this stage.  I didn't try out
    overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
    task log shows that the memory usage rate is just 0.604, so i assume setting
    the same with a larger value wont cater to a solution in my case.Could some
    one please guide me what are the actual parameters and the values I should
    set to get things rolling.

    Thank You

    Regards
    Bejoy.K.S
  • Viral Bajaria at Apr 1, 2011 at 8:26 am
    Bejoy,

    We still use older version of Hive (0.5). In that version the join order
    used to matter. You needed to keep the largest table as the rightmost in
    your JOIN sequence to make sure that it is streamed and thus avoid the OOM
    exceptions which are caused by mappers which load the entire table in memory
    and run out of JVM -Xmx parameter.

    If you cannot do that, then you can use the STREAMTABLE hint as follows:
    SELECT /*+ STREAMTABLE(t1) */ * FROM t1 join t2 on t1.col1 = t2.col1 <.....>

    Thanks,
    Viral
    On Thu, Mar 31, 2011 at 10:15 PM, wrote:

    Thanks Yongqiang for your reply. I'm running a hive script which has nearly
    10 joins within. From those joins all map joins(9 of them involves one small
    table) involving smaller tables are running fine. Just 1 join is on two
    larger tables and this map join fails, however since the back up task(common
    join) is executed successfully the whole hive job runs to completion
    successfully.
    In brief my hive job is running successfully now, but I just want to
    get the failed map join as well running instead of the common join being
    executed. I'm curious to see what would be the performance improvement out
    there with this difference in execution.
    To get a map join executed on larger tables do I have to for memory
    parameters with hadoop?
    Since my entire task is already running to completion and I want get just a
    map join working, shouldn't altering some hive map join parameters do my
    job?
    Please advise


    Regards
    Bejoy K S

    -----Original Message-----
    From: yongqiang he <heyongqiangict@gmail.com>
    Date: Thu, 31 Mar 2011 16:25:03
    To: <user@hive.apache.org>
    Reply-To: user@hive.apache.org
    Subject: Re: Hive map join - process a little larger tables with moderate
    number of rows

    You possibly got a OOM error when processing the small tables. OOM is
    a fatal error that can not be controlled by the hive configs. So can
    you try to increase your memory setting?

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks wrote:
    Hi Experts
    I'm currently working with hive 0.7 mostly with JOINS. In all
    permissible cases i'm using map joins by setting the
    hive.auto.convert.join=true parameter. Usage of local map joins have made a
    considerable performance improvement in hive queries.I have used this local
    map join only on the default set of hive configuration parameters now i'd
    try to dig more deeper into this. Want to try out this local map join on
    little bigger tables with more no of rows. Given below is a failure log of
    one of my local map tasks and in turn executing its back up common join task
    2011-03-31 09:56:54 Starting to launch local task to process map
    join; maximum memory = 932118528
    2011-03-31 09:56:57 Processing rows: 200000 Hashtable size:
    199999 Memory usage: 115481024 rate: 0.124
    2011-03-31 09:57:00 Processing rows: 300000 Hashtable size:
    299999 Memory usage: 169344064 rate: 0.182
    2011-03-31 09:57:03 Processing rows: 400000 Hashtable size:
    399999 Memory usage: 232132792 rate: 0.249
    2011-03-31 09:57:06 Processing rows: 500000 Hashtable size:
    499999 Memory usage: 282338544 rate: 0.303
    2011-03-31 09:57:10 Processing rows: 600000 Hashtable size:
    599999 Memory usage: 336738640 rate: 0.361
    2011-03-31 09:57:14 Processing rows: 700000 Hashtable size:
    699999 Memory usage: 391117888 rate: 0.42
    2011-03-31 09:57:22 Processing rows: 800000 Hashtable size:
    799999 Memory usage: 453906496 rate: 0.487
    2011-03-31 09:57:27 Processing rows: 900000 Hashtable size:
    899999 Memory usage: 508306552 rate: 0.545
    2011-03-31 09:57:34 Processing rows: 1000000 Hashtable size:
    999999 Memory usage: 562706496 rate: 0.604
    FAILED: Execution Error, return code 2 from
    org.apache.hadoop.hive.ql.exec.MapredLocalTask
    ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
    Launching Job 4 out of 6


    Here i"d like to make this local map task running, for the same i tried
    setting the following hive parameters as
    hive -f HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
    hive.mapjoin.smalltable.filesize=40000000 -hiveconf
    hive.auto.convert.join=true
    Butting setting the two config parameters doesn't make my local map task
    proceed beyond this stage. I didn't try out
    overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
    task log shows that the memory usage rate is just 0.604, so i assume setting
    the same with a larger value wont cater to a solution in my case.Could some
    one please guide me what are the actual parameters and the values I should
    set to get things rolling.

    Thank You

    Regards
    Bejoy.K.S
  • Bejoy_ks at Apr 1, 2011 at 2:18 pm
    Thanks for your reply Viral. However in later versions of hive you don't have to tell hive anything (which is the smaller table) . During runtime hive itself identifies the smaller table and do the local map task on the same irrespective of whether it comes on left or right side of the join. There is a face book post on such join optimizations within hive, you can get a better picture from the same .
    Regards
    Bejoy K S

    -----Original Message-----
    From: Viral Bajaria <viral.bajaria@gmail.com>
    Date: Fri, 1 Apr 2011 01:25:41
    To: <user@hive.apache.org>; <bejoy_ks@yahoo.com>
    Reply-To: user@hive.apache.org
    Subject: Re: Hive map join - process a little larger tables with
    moderatenumber of rows

    Bejoy,

    We still use older version of Hive (0.5). In that version the join order
    used to matter. You needed to keep the largest table as the rightmost in
    your JOIN sequence to make sure that it is streamed and thus avoid the OOM
    exceptions which are caused by mappers which load the entire table in memory
    and run out of JVM -Xmx parameter.

    If you cannot do that, then you can use the STREAMTABLE hint as follows:
    SELECT /*+ STREAMTABLE(t1) */ * FROM t1 join t2 on t1.col1 = t2.col1 <.....>

    Thanks,
    Viral
    On Thu, Mar 31, 2011 at 10:15 PM, wrote:

    Thanks Yongqiang for your reply. I'm running a hive script which has nearly
    10 joins within. From those joins all map joins(9 of them involves one small
    table) involving smaller tables are running fine. Just 1 join is on two
    larger tables and this map join fails, however since the back up task(common
    join) is executed successfully the whole hive job runs to completion
    successfully.
    In brief my hive job is running successfully now, but I just want to
    get the failed map join as well running instead of the common join being
    executed. I'm curious to see what would be the performance improvement out
    there with this difference in execution.
    To get a map join executed on larger tables do I have to for memory
    parameters with hadoop?
    Since my entire task is already running to completion and I want get just a
    map join working, shouldn't altering some hive map join parameters do my
    job?
    Please advise


    Regards
    Bejoy K S

    -----Original Message-----
    From: yongqiang he <heyongqiangict@gmail.com>
    Date: Thu, 31 Mar 2011 16:25:03
    To: <user@hive.apache.org>
    Reply-To: user@hive.apache.org
    Subject: Re: Hive map join - process a little larger tables with moderate
    number of rows

    You possibly got a OOM error when processing the small tables. OOM is
    a fatal error that can not be controlled by the hive configs. So can
    you try to increase your memory setting?

    thanks
    yongqiang
    On Thu, Mar 31, 2011 at 7:25 AM, Bejoy Ks wrote:
    Hi Experts
    I'm currently working with hive 0.7 mostly with JOINS. In all
    permissible cases i'm using map joins by setting the
    hive.auto.convert.join=true parameter. Usage of local map joins have made a
    considerable performance improvement in hive queries.I have used this local
    map join only on the default set of hive configuration parameters now i'd
    try to dig more deeper into this. Want to try out this local map join on
    little bigger tables with more no of rows. Given below is a failure log of
    one of my local map tasks and in turn executing its back up common join task
    2011-03-31 09:56:54 Starting to launch local task to process map
    join; maximum memory = 932118528
    2011-03-31 09:56:57 Processing rows: 200000 Hashtable size:
    199999 Memory usage: 115481024 rate: 0.124
    2011-03-31 09:57:00 Processing rows: 300000 Hashtable size:
    299999 Memory usage: 169344064 rate: 0.182
    2011-03-31 09:57:03 Processing rows: 400000 Hashtable size:
    399999 Memory usage: 232132792 rate: 0.249
    2011-03-31 09:57:06 Processing rows: 500000 Hashtable size:
    499999 Memory usage: 282338544 rate: 0.303
    2011-03-31 09:57:10 Processing rows: 600000 Hashtable size:
    599999 Memory usage: 336738640 rate: 0.361
    2011-03-31 09:57:14 Processing rows: 700000 Hashtable size:
    699999 Memory usage: 391117888 rate: 0.42
    2011-03-31 09:57:22 Processing rows: 800000 Hashtable size:
    799999 Memory usage: 453906496 rate: 0.487
    2011-03-31 09:57:27 Processing rows: 900000 Hashtable size:
    899999 Memory usage: 508306552 rate: 0.545
    2011-03-31 09:57:34 Processing rows: 1000000 Hashtable size:
    999999 Memory usage: 562706496 rate: 0.604
    FAILED: Execution Error, return code 2 from
    org.apache.hadoop.hive.ql.exec.MapredLocalTask
    ATTEMPT: Execute BackupTask: org.apache.hadoop.hive.ql.exec.MapRedTask
    Launching Job 4 out of 6


    Here i"d like to make this local map task running, for the same i tried
    setting the following hive parameters as
    hive -f HiveJob.txt -hiveconf hive.mapjoin.maxsize=1000000 -hiveconf
    hive.mapjoin.smalltable.filesize=40000000 -hiveconf
    hive.auto.convert.join=true
    Butting setting the two config parameters doesn't make my local map task
    proceed beyond this stage. I didn't try out
    overriding the hive.mapjoin.localtask.max.memory.usage=0.90 because from my
    task log shows that the memory usage rate is just 0.604, so i assume setting
    the same with a larger value wont cater to a solution in my case.Could some
    one please guide me what are the actual parameters and the values I should
    set to get things rolling.

    Thank You

    Regards
    Bejoy.K.S

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 31, '11 at 2:26p
activeApr 1, '11 at 2:18p
posts7
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase