Grokbase Groups Hive user April 2011
FAQ
Could not find the instructions regarding this to avoid performance issues
when too many mappers have to be created for every small file. Thanks!

Search Discussions

  • V.Senthil Kumar at Apr 8, 2011 at 6:43 pm
    You can add these lines in hive-site.xml. It creates only one file at the end.
    Hope it helps.

    <property>
    <name>hive.merge.mapredfiles</name>
    <value>true</value>
    <description>Merge small files at the end of a map-reduce job</description>
    </property>

    <property>
    <name>hive.input.format</name>
    <value>org.apache.hadoop.hive.ql.io.CombineHiveInputFormat</value>
    <description>The default input format, if it is not specified, the system
    assigns it. It is set to HiveInputFormat for hadoop versions 17, 18 and 19,
    whereas it is set to CombineHiveInputFormat for hadoop 20. The user can always
    overwrite it - if there is a bug in CombineHiveInputFormat, it can always be
    manually set to HiveInputFormat. </description>
    </property>






    ________________________________
    From: Michael Jiang <it.mjjiang@gmail.com>
    To: user@hive.apache.org
    Sent: Fri, April 8, 2011 11:34:58 AM
    Subject: How to configure Hive to use CombineFileInputFormat in case of too many
    small files

    Could not find the instructions regarding this to avoid performance issues when
    too many mappers have to be created for every small file. Thanks!
  • Michael Jiang at Apr 8, 2011 at 6:56 pm
    Thanks Kumar.

    Are there other settings to fine tune how small files are merged into a
    bigger one that a mapper takes? Basically I want to match the size of a
    merged file to the block size.


    On Fri, Apr 8, 2011 at 11:43 AM, V.Senthil Kumar wrote:

    You can add these lines in hive-site.xml. It creates only one file at the
    end. Hope it helps.

    <property>
    <name>hive.merge.mapredfiles</name>
    <value>true</value>
    <description>Merge small files at the end of a map-reduce
    job</description>
    </property>

    <property>
    <name>hive.input.format</name>
    <value>org.apache.hadoop.hive.ql.io.CombineHiveInputFormat</value>
    <description>The default input format, if it is not specified, the system
    assigns it. It is set to HiveInputFormat for hadoop versions 17, 18 and 19,
    whereas it is set to CombineHiveInputFormat for hadoop 20. The user can
    always overwrite it - if there is a bug in CombineHiveInputFormat, it can
    always be manually set to HiveInputFormat. </description>
    </property>



    ------------------------------
    *From:* Michael Jiang <it.mjjiang@gmail.com>
    *To:* user@hive.apache.org
    *Sent:* Fri, April 8, 2011 11:34:58 AM
    *Subject:* How to configure Hive to use CombineFileInputFormat in case of
    too many small files

    Could not find the instructions regarding this to avoid performance issues
    when too many mappers have to be created for every small file. Thanks!
  • Michael Jiang at Apr 8, 2011 at 9:38 pm
    I didn't find a configuration property specifically for controlling the
    merged input size for a mapper except those for map or reduce output size.
    e.g. "hive.merge.size.smallfiles.avgsize" looks like what I want, but it
    actually applies to output. In pig, " pig.maxCombinedSplitSize" does the
    similar job. Is there a similar setting in Hive?

    Thanks!
    On Fri, Apr 8, 2011 at 12:18 PM, V.Senthil Kumar wrote:



    You can find other related configuration parameters here
    http://wiki.apache.org/hadoop/Hive/AdminManual/Configuration
    I think you can set the file sizes. I haven't tried it but i guess its
    there in this page.

    ------------------------------
    *From:* Michael Jiang <it.mjjiang@gmail.com>
    *To:* user@hive.apache.org
    *Cc:* V.Senthil Kumar <vaisen2000@yahoo.com>
    *Sent:* Fri, April 8, 2011 11:56:21 AM
    *Subject:* Re: How to configure Hive to use CombineFileInputFormat in case
    of too many small files

    Thanks Kumar.

    Are there other settings to fine tune how small files are merged into a
    bigger one that a mapper takes? Basically I want to match the size of a
    merged file to the block size.


    On Fri, Apr 8, 2011 at 11:43 AM, V.Senthil Kumar wrote:

    You can add these lines in hive-site.xml. It creates only one file at the
    end. Hope it helps.

    <property>
    <name>hive.merge.mapredfiles</name>
    <value>true</value>
    <description>Merge small files at the end of a map-reduce
    job</description>
    </property>

    <property>
    <name>hive.input.format</name>
    <value>org.apache.hadoop.hive.ql.io.CombineHiveInputFormat</value>
    <description>The default input format, if it is not specified, the
    system assigns it. It is set to HiveInputFormat for hadoop versions 17, 18
    and 19, whereas it is set to CombineHiveInputFormat for hadoop 20. The user
    can always overwrite it - if there is a bug in CombineHiveInputFormat, it
    can always be manually set to HiveInputFormat. </description>
    </property>



    ------------------------------
    *From:* Michael Jiang <it.mjjiang@gmail.com>
    *To:* user@hive.apache.org
    *Sent:* Fri, April 8, 2011 11:34:58 AM
    *Subject:* How to configure Hive to use CombineFileInputFormat in case of
    too many small files

    Could not find the instructions regarding this to avoid performance issues
    when too many mappers have to be created for every small file. Thanks!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedApr 8, '11 at 6:35p
activeApr 8, '11 at 9:38p
posts4
users2
websitehive.apache.org

2 users in discussion

Michael Jiang: 3 posts V.Senthil Kumar: 1 post

People

Translate

site design / logo © 2021 Grokbase