Grokbase Groups Hive user June 2010
FAQ
Ning is currently out on vacation; I think he'll be back to working on this when he returns.

JVS

________________________________________
From: Viraj Bhat [viraj@yahoo-inc.com]
Sent: Thursday, July 01, 2010 11:40 PM
To: hive-user@hadoop.apache.org
Subject: RE: merging the size of the reduce output

Okay I read that this is a work in progress
https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files when doing dynamic partitioning.
There was a suggestion to try:
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat for Hadoop 20 when running queries on this partition.
Viraj

________________________________
From: Viraj Bhat
Sent: Thursday, July 01, 2010 11:31 PM
To: hive-user@hadoop.apache.org
Cc: athusoo@facebook.com
Subject: RE: merging the size of the reduce output

Hi Yongqiang,
I am facing a similar situation, I am using the latest trunk of Hive. I am using dynamic partitioning of Hive and it is a Map only job, which converts files from compressed TXT gz to RC format.
The DDL of the task looks similar to:

FROM gztable

INSERT OVERWRITE TABLE rctable


PARTITION(datestamp, partitionlevel1, partitionlevel1)


SELECT …


..
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=256000000;
set hive.merge.size.smallfiles.avgsize=256000000;

When I run a job, I see that the following are set to false in the job.xml when the job starts up.
hive.merge.mapfiles = false;
hive.merge.mapredfiles = false;

Is this a bug with dynamic partitioning? Is there something else I need to set to get this to work and remove small files I might be generating.

Viraj

________________________________
From: Yongqiang He
Sent: Sunday, June 13, 2010 10:56 PM
To: hive-user@hadoop.apache.org
Subject: Re: merging the size of the reduce output

I think there is another parameter “hive.merge.smallfiles.avgsize” to see whether to do the merge job or not based on the average output files’ size. The default for that parameter is 16M. So if the average output’s size is larger than 16M, will not merge.
Maybe you can try to increase that value to see.

Thanks
Yongqiang
On 6/13/10 10:41 PM, "Sammy Yu" wrote:
Hi,
I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to true via the shell tool and hive-default.xml configuration file. However, it appears somehow the job configuration is changed before the job is submitted. Is there another condition that can cause this to happen?

Thanks,
Sammy


On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:
Looking at ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java, hive.merge.mapredfiles is effective if there is a reducer for your job.
Otherwise you should have set hive.merge.mapfiles to true.


On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:
Hi,
I'm running the latest version of trunk r953172. I'm doing doing a dynamic partition insert overwrite query which generates a lot of small files in each of the partition. I was hoping this could be solved by setting hive.merge.mapredfiles to true. However, it seems like whenever the job is submitted it is always set to false, thus it doesnt seem to have any effect. I also tried to modified this property in the hive-default.xml, but it didn't work either.

Thanks,
Sammy

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 7 of 10 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 13, '10 at 6:23a
activeJul 7, '10 at 1:52a
posts10
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase