FAQ
Hi,
I have a dynamic partition query which generates quite a few small
files which I would like to merge:

SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET hive.merge.size.per.task=256000000;
SET hive.merge.smallfiles.avgsize=16000000000;
SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.mergejob.maponly=true;
INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
PARTITION(org_id, day)
SELECT session_id, permanent_id, first_date, last_date, week, month, quarter,
referral_type, search_engine, us_search_engine,
keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
pages_viewed,
entry_page, page_types,
org_id, day
FROM daily_conversions_without_rank_table;

I am running the latest version from trunk with HIVE-1622, but it
seems like I just can't get the post merge process to happen. I have
raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering
at runtime is causing the merge process to be skipped. Attached are
the hive output and log files.


Thanks,
Sammy

Search Discussions

  • Edward Capriolo at Oct 16, 2010 at 12:11 am
    Sammy,

    This is not the exact remedy you were looking for, but my company open
    sourced our file crusher utility.

    http://www.jointhegrid.com/hadoop_filecrush/index.jsp

    We use it to good effect to turn many small files into one. Works with
    text and sequence files , and custom writables.

    Edward
    On Friday, October 15, 2010, Sammy Yu wrote:
    Hi,
    I have a dynamic partition query which generates quite a few small
    files which I would like to merge:

    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    SET hive.exec.compress.output=true;
    SET io.seqfile.compression.type=BLOCK;
    SET hive.merge.size.per.task=256000000;
    SET hive.merge.smallfiles.avgsize=16000000000;
    SET hive.merge.mapfiles=true;
    SET hive.merge.mapredfiles=true;
    SET hive.mergejob.maponly=true;
    INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
    PARTITION(org_id, day)
    SELECT session_id, permanent_id, first_date, last_date, week, month, quarter,
    referral_type, search_engine, us_search_engine,
    keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
    pages_viewed,
    entry_page, page_types,
    org_id, day
    FROM daily_conversions_without_rank_table;

    I am running the latest version from trunk with HIVE-1622, but it
    seems like I just can't get the post merge process to happen. I have
    raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
    at runtime is causing the merge process to be skipped.  Attached are
    the hive output and log files.


    Thanks,
    Sammy
  • Ning Zhang at Oct 16, 2010 at 12:55 am
    The output file shows it only have 2 jobs (the mapreduce job and the move task). This indicates that the plan does not have merge enabled. Merge should consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). Can you send the plan of the query?

    One thing I noticed is that your are using Amazon EMR. I'm not sure if this is enabled since SET hive.mergejob.maponly=true requires CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported some distribution of Hadoop doesn't support that). So additional thing you can try is to remove this setting.
    On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:

    Hi,
    I have a dynamic partition query which generates quite a few small
    files which I would like to merge:

    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    SET hive.exec.compress.output=true;
    SET io.seqfile.compression.type=BLOCK;
    SET hive.merge.size.per.task=256000000;
    SET hive.merge.smallfiles.avgsize=16000000000;
    SET hive.merge.mapfiles=true;
    SET hive.merge.mapredfiles=true;
    SET hive.mergejob.maponly=true;
    INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
    PARTITION(org_id, day)
    SELECT session_id, permanent_id, first_date, last_date, week, month, quarter,
    referral_type, search_engine, us_search_engine,
    keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
    pages_viewed,
    entry_page, page_types,
    org_id, day
    FROM daily_conversions_without_rank_table;

    I am running the latest version from trunk with HIVE-1622, but it
    seems like I just can't get the post merge process to happen. I have
    raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering
    at runtime is causing the merge process to be skipped. Attached are
    the hive output and log files.


    Thanks,
    Sammy
    <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>
  • Sammy Yu at Oct 16, 2010 at 5:50 am
    Hi guys,
    Thanks for the response. I tried running without
    hive.mergejob.maponly with the same result. I've attached the explain
    extended output. I am running this query on EC2 boxes, however it's
    not running on EMR. Hive is running on top of a hadoop 0.20.2 setup..

    Thanks,
    Sammy
    On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang wrote:
    The output file shows it only have 2 jobs (the mapreduce job and the move task). This indicates that the plan does not have merge enabled. Merge should consists of a ConditionalTask and 2 sub tasks (a MR task and a move task). Can you send the plan of the query?

    One thing I noticed is that your are using Amazon EMR. I'm not sure if this is enabled since SET hive.mergejob.maponly=true requires CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported some distribution of Hadoop doesn't support that). So additional thing you can try is to remove this setting.
    On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:

    Hi,
    I have a dynamic partition query which generates quite a few small
    files which I would like to merge:

    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    SET hive.exec.compress.output=true;
    SET io.seqfile.compression.type=BLOCK;
    SET hive.merge.size.per.task=256000000;
    SET hive.merge.smallfiles.avgsize=16000000000;
    SET hive.merge.mapfiles=true;
    SET hive.merge.mapredfiles=true;
    SET hive.mergejob.maponly=true;
    INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
    PARTITION(org_id, day)
    SELECT session_id, permanent_id, first_date, last_date, week, month, quarter,
    referral_type, search_engine, us_search_engine,
    keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
    pages_viewed,
    entry_page, page_types,
    org_id, day
    FROM daily_conversions_without_rank_table;

    I am running the latest version from trunk with HIVE-1622, but it
    seems like I just can't get the post merge process to happen. I have
    raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
    at runtime is causing the merge process to be skipped.  Attached are
    the hive output and log files.


    Thanks,
    Sammy
    <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>


    --
    Chief Architect, BrightEdge
    email: syu@brightedge.com   |   mobile: 650.539.4867  |   fax:
    650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
    94404
  • Dave Brondsema at Nov 10, 2010 at 6:05 pm
    Hi, has there been any resolution to this? I'm having the same trouble.
    With Hive 0.6 and Hadoop 0.18 and a dynamic partition
    insert, hive.merge.mapredfiles doesn't work. It works fine for a static
    partition insert. What I'm seeing is that even when I
    set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
    partition insert.

    I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
    like maybe Hadoop 0.20 is required for this?

    Thanks,
    On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote:

    Hi guys,
    Thanks for the response. I tried running without
    hive.mergejob.maponly with the same result. I've attached the explain
    extended output. I am running this query on EC2 boxes, however it's
    not running on EMR. Hive is running on top of a hadoop 0.20.2 setup..

    Thanks,
    Sammy
    On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang wrote:
    The output file shows it only have 2 jobs (the mapreduce job and the move
    task). This indicates that the plan does not have merge enabled. Merge
    should consists of a ConditionalTask and 2 sub tasks (a MR task and a move
    task). Can you send the plan of the query?
    One thing I noticed is that your are using Amazon EMR. I'm not sure if
    this is enabled since SET hive.mergejob.maponly=true requires
    CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported
    some distribution of Hadoop doesn't support that). So additional thing you
    can try is to remove this setting.
    On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:

    Hi,
    I have a dynamic partition query which generates quite a few small
    files which I would like to merge:

    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    SET hive.exec.compress.output=true;
    SET io.seqfile.compression.type=BLOCK;
    SET hive.merge.size.per.task=256000000;
    SET hive.merge.smallfiles.avgsize=16000000000;
    SET hive.merge.mapfiles=true;
    SET hive.merge.mapredfiles=true;
    SET hive.mergejob.maponly=true;
    INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
    PARTITION(org_id, day)
    SELECT session_id, permanent_id, first_date, last_date, week, month,
    quarter,
    referral_type, search_engine, us_search_engine,
    keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
    pages_viewed,
    entry_page, page_types,
    org_id, day
    FROM daily_conversions_without_rank_table;

    I am running the latest version from trunk with HIVE-1622, but it
    seems like I just can't get the post merge process to happen. I have
    raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering
    at runtime is causing the merge process to be skipped. Attached are
    the hive output and log files.


    Thanks,
    Sammy
    <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>


    --
    Chief Architect, BrightEdge
    email: syu@brightedge.com | mobile: 650.539.4867 | fax:
    650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA
    94404


    --
    Dave Brondsema
    Software Engineer
    Geeknet

    www.geek.net
  • Yongqiang he at Nov 10, 2010 at 9:31 pm
    I think the problem was solved in hive trunk. You can just try hive trunk.
    On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema wrote:
    Hi, has there been any resolution to this?  I'm having the same trouble.
    With Hive 0.6 and Hadoop 0.18 and a dynamic partition
    insert, hive.merge.mapredfiles doesn't work.  It works fine for a static
    partition insert.  What I'm seeing is that even when I
    set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
    partition insert.
    I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
    like maybe Hadoop 0.20 is required for this?
    Thanks,
    On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote:

    Hi guys,
    Thanks for the response.   I tried running without
    hive.mergejob.maponly with the same result.  I've attached the explain
    extended output.  I am running this query on EC2 boxes, however it's
    not running on EMR.  Hive is running on top of a hadoop 0.20.2 setup..

    Thanks,
    Sammy
    On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang wrote:
    The output file shows it only have 2 jobs (the mapreduce job and the
    move task). This indicates that the plan does not have merge enabled. Merge
    should consists of a ConditionalTask and 2 sub tasks (a MR task and a move
    task). Can you send the plan of the query?

    One thing I noticed is that your are using Amazon EMR. I'm not sure if
    this is enabled since SET hive.mergejob.maponly=true requires
    CombineHiveInputFormat (only available in Hadoop 0.20 and someone reported
    some distribution of Hadoop doesn't support that). So additional thing you
    can try is to remove this setting.
    On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:

    Hi,
    I have a dynamic partition query which generates quite a few small
    files which I would like to merge:

    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    SET hive.exec.compress.output=true;
    SET io.seqfile.compression.type=BLOCK;
    SET hive.merge.size.per.task=256000000;
    SET hive.merge.smallfiles.avgsize=16000000000;
    SET hive.merge.mapfiles=true;
    SET hive.merge.mapredfiles=true;
    SET hive.mergejob.maponly=true;
    INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
    PARTITION(org_id, day)
    SELECT session_id, permanent_id, first_date, last_date, week, month,
    quarter,
    referral_type, search_engine, us_search_engine,
    keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
    pages_viewed,
    entry_page, page_types,
    org_id, day
    FROM daily_conversions_without_rank_table;

    I am running the latest version from trunk with HIVE-1622, but it
    seems like I just can't get the post merge process to happen. I have
    raised hive.merge.smallfiles.avgsize.  I'm wondering if the filtering
    at runtime is causing the merge process to be skipped.  Attached are
    the hive output and log files.


    Thanks,
    Sammy
    <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>


    --
    Chief Architect, BrightEdge
    email: syu@brightedge.com   |   mobile: 650.539.4867  |   fax:
    650.521.9678  |  address: 1850 Gateway Dr Suite 400, San Mateo, CA
    94404


    --
    Dave Brondsema
    Software Engineer
    Geeknet

    www.geek.net
  • Dave Brondsema at Nov 12, 2010 at 5:44 pm
    It seems that I can't use this with Hadoop 0.18 since the
    Hadoop18Shims.getCombineFileInputFormat returns null, and
    SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if
    CombineFileInputFormat is not supported. Is that right? Maybe I can copy
    the Hadoop19Shims implementation of getCombineFileInputFormat into
    Hadoop18Shims?
    On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he wrote:

    I think the problem was solved in hive trunk. You can just try hive trunk.
    On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema wrote:
    Hi, has there been any resolution to this? I'm having the same trouble.
    With Hive 0.6 and Hadoop 0.18 and a dynamic partition
    insert, hive.merge.mapredfiles doesn't work. It works fine for a static
    partition insert. What I'm seeing is that even when I
    set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
    partition insert.
    I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
    like maybe Hadoop 0.20 is required for this?
    Thanks,
    On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote:

    Hi guys,
    Thanks for the response. I tried running without
    hive.mergejob.maponly with the same result. I've attached the explain
    extended output. I am running this query on EC2 boxes, however it's
    not running on EMR. Hive is running on top of a hadoop 0.20.2 setup..

    Thanks,
    Sammy
    On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang wrote:
    The output file shows it only have 2 jobs (the mapreduce job and the
    move task). This indicates that the plan does not have merge enabled.
    Merge
    should consists of a ConditionalTask and 2 sub tasks (a MR task and a
    move
    task). Can you send the plan of the query?

    One thing I noticed is that your are using Amazon EMR. I'm not sure if
    this is enabled since SET hive.mergejob.maponly=true requires
    CombineHiveInputFormat (only available in Hadoop 0.20 and someone
    reported
    some distribution of Hadoop doesn't support that). So additional thing
    you
    can try is to remove this setting.
    On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:

    Hi,
    I have a dynamic partition query which generates quite a few small
    files which I would like to merge:

    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    SET hive.exec.compress.output=true;
    SET io.seqfile.compression.type=BLOCK;
    SET hive.merge.size.per.task=256000000;
    SET hive.merge.smallfiles.avgsize=16000000000;
    SET hive.merge.mapfiles=true;
    SET hive.merge.mapredfiles=true;
    SET hive.mergejob.maponly=true;
    INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
    PARTITION(org_id, day)
    SELECT session_id, permanent_id, first_date, last_date, week, month,
    quarter,
    referral_type, search_engine, us_search_engine,
    keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
    pages_viewed,
    entry_page, page_types,
    org_id, day
    FROM daily_conversions_without_rank_table;

    I am running the latest version from trunk with HIVE-1622, but it
    seems like I just can't get the post merge process to happen. I have
    raised hive.merge.smallfiles.avgsize. I'm wondering if the filtering
    at runtime is causing the merge process to be skipped. Attached are
    the hive output and log files.


    Thanks,
    Sammy
    <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>


    --
    Chief Architect, BrightEdge
    email: syu@brightedge.com | mobile: 650.539.4867 | fax:
    650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA
    94404


    --
    Dave Brondsema
    Software Engineer
    Geeknet

    www.geek.net


    --
    Dave Brondsema
    Software Engineer
    Geeknet

    www.geek.net
  • Dave Brondsema at Nov 12, 2010 at 9:40 pm
    I copied Hadoop19Shims' implementation of getCombineFileInputFormat
    (HIVE-1121) into Hadoop18Shims and it worked, if anyone is interested.

    And hopefully we can upgrade our Hadoop version soon :)
    On Fri, Nov 12, 2010 at 12:44 PM, Dave Brondsema wrote:

    It seems that I can't use this with Hadoop 0.18 since the
    Hadoop18Shims.getCombineFileInputFormat returns null, and
    SemanticAnalyzer.java sets HIVEMERGEMAPREDFILES to false if
    CombineFileInputFormat is not supported. Is that right? Maybe I can copy
    the Hadoop19Shims implementation of getCombineFileInputFormat into
    Hadoop18Shims?

    On Wed, Nov 10, 2010 at 4:31 PM, yongqiang he wrote:

    I think the problem was solved in hive trunk. You can just try hive trunk.

    On Wed, Nov 10, 2010 at 10:05 AM, Dave Brondsema <dbrondsema@geek.net>
    wrote:
    Hi, has there been any resolution to this? I'm having the same trouble.
    With Hive 0.6 and Hadoop 0.18 and a dynamic partition
    insert, hive.merge.mapredfiles doesn't work. It works fine for a static
    partition insert. What I'm seeing is that even when I
    set hive.merge.mapredfiles=true, the jobconf has it as false for the dynamic
    partition insert.
    I was reading https://issues.apache.org/jira/browse/HIVE-1307 and it looks
    like maybe Hadoop 0.20 is required for this?
    Thanks,
    On Sat, Oct 16, 2010 at 1:50 AM, Sammy Yu wrote:

    Hi guys,
    Thanks for the response. I tried running without
    hive.mergejob.maponly with the same result. I've attached the explain
    extended output. I am running this query on EC2 boxes, however it's
    not running on EMR. Hive is running on top of a hadoop 0.20.2 setup..

    Thanks,
    Sammy

    On Fri, Oct 15, 2010 at 5:58 PM, Ning Zhang <nzhang@facebook.com>
    wrote:
    The output file shows it only have 2 jobs (the mapreduce job and the
    move task). This indicates that the plan does not have merge enabled.
    Merge
    should consists of a ConditionalTask and 2 sub tasks (a MR task and a
    move
    task). Can you send the plan of the query?

    One thing I noticed is that your are using Amazon EMR. I'm not sure
    if
    this is enabled since SET hive.mergejob.maponly=true requires
    CombineHiveInputFormat (only available in Hadoop 0.20 and someone
    reported
    some distribution of Hadoop doesn't support that). So additional
    thing you
    can try is to remove this setting.
    On Oct 15, 2010, at 1:43 PM, Sammy Yu wrote:

    Hi,
    I have a dynamic partition query which generates quite a few small
    files which I would like to merge:

    SET hive.exec.dynamic.partition.mode=nonstrict;
    SET hive.exec.dynamic.partition=true;
    SET hive.exec.compress.output=true;
    SET io.seqfile.compression.type=BLOCK;
    SET hive.merge.size.per.task=256000000;
    SET hive.merge.smallfiles.avgsize=16000000000;
    SET hive.merge.mapfiles=true;
    SET hive.merge.mapredfiles=true;
    SET hive.mergejob.maponly=true;
    INSERT OVERWRITE TABLE daily_conversions_without_rank_all_table
    PARTITION(org_id, day)
    SELECT session_id, permanent_id, first_date, last_date, week, month,
    quarter,
    referral_type, search_engine, us_search_engine,
    keyword, unnormalized_keyword, branded, conversion_meet, goals_meet,
    pages_viewed,
    entry_page, page_types,
    org_id, day
    FROM daily_conversions_without_rank_table;

    I am running the latest version from trunk with HIVE-1622, but it
    seems like I just can't get the post merge process to happen. I have
    raised hive.merge.smallfiles.avgsize. I'm wondering if the
    filtering
    at runtime is causing the merge process to be skipped. Attached are
    the hive output and log files.


    Thanks,
    Sammy
    <hive_output.txt><hive_job_log_root_201010151114_2037492391.txt>


    --
    Chief Architect, BrightEdge
    email: syu@brightedge.com | mobile: 650.539.4867 | fax:
    650.521.9678 | address: 1850 Gateway Dr Suite 400, San Mateo, CA
    94404


    --
    Dave Brondsema
    Software Engineer
    Geeknet

    www.geek.net


    --
    Dave Brondsema
    Software Engineer
    Geeknet

    www.geek.net


    --
    Dave Brondsema
    Software Engineer
    Geeknet

    www.geek.net

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedOct 15, '10 at 8:44p
activeNov 12, '10 at 9:40p
posts8
users5
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase