Grokbase Groups Hive user June 2010
FAQ
I'm sure Ning will appreciate any help you can give, so if you make progress, feel free to upload an updated patch.

JVS
On Jul 2, 2010, at 4:44 PM, Viraj Bhat wrote:

Hi John,
Thanks again for letting me know. This came be overcome though by using
the CombineInputFormat, unfortunately I am not using that branch ;)
Also a large number of small files for some partitions cause poor
utilization to the Namenode.
Please let me know if you need help with the patch.
Thanks
Viraj

-----Original Message-----
From: John Sichi
Sent: Thursday, July 01, 2010 11:57 PM
To: hive-user@hadoop.apache.org
Subject: RE: merging the size of the reduce output

Ning is currently out on vacation; I think he'll be back to working on
this when he returns.

JVS

________________________________________
From: Viraj Bhat [viraj@yahoo-inc.com]
Sent: Thursday, July 01, 2010 11:40 PM
To: hive-user@hadoop.apache.org
Subject: RE: merging the size of the reduce output

Okay I read that this is a work in progress
https://issues.apache.org/jira/browse/HIVE-1307 to deal with small files
when doing dynamic partitioning.
There was a suggestion to try:
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
for Hadoop 20 when running queries on this partition.
Viraj

________________________________
From: Viraj Bhat
Sent: Thursday, July 01, 2010 11:31 PM
To: hive-user@hadoop.apache.org
Cc: athusoo@facebook.com
Subject: RE: merging the size of the reduce output

Hi Yongqiang,
I am facing a similar situation, I am using the latest trunk of Hive. I
am using dynamic partitioning of Hive and it is a Map only job, which
converts files from compressed TXT gz to RC format.
The DDL of the task looks similar to:

FROM gztable

INSERT OVERWRITE TABLE rctable

...
PARTITION(datestamp, partitionlevel1, partitionlevel1)


SELECT ...


..
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=256000000;
set hive.merge.size.smallfiles.avgsize=256000000;

When I run a job, I see that the following are set to false in the
job.xml when the job starts up.
hive.merge.mapfiles = false;
hive.merge.mapredfiles = false;

Is this a bug with dynamic partitioning? Is there something else I need
to set to get this to work and remove small files I might be generating.

Viraj

________________________________
From: Yongqiang He
Sent: Sunday, June 13, 2010 10:56 PM
To: hive-user@hadoop.apache.org
Subject: Re: merging the size of the reduce output

I think there is another parameter "hive.merge.smallfiles.avgsize" to
see whether to do the merge job or not based on the average output
files' size. The default for that parameter is 16M. So if the average
output's size is larger than 16M, will not merge.
Maybe you can try to increase that value to see.

Thanks
Yongqiang
On 6/13/10 10:41 PM, "Sammy Yu" wrote:
Hi,
I have both hive.merge.mapredfiles and hive.merge.mapredfiles set to
true via the shell tool and hive-default.xml configuration file.
However, it appears somehow the job configuration is changed before the
job is submitted. Is there another condition that can cause this to
happen?

Thanks,
Sammy


On Sun, Jun 13, 2010 at 7:39 AM, Ted Yu wrote:
Looking at
ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java,
hive.merge.mapredfiles is effective if there is a reducer for your job.
Otherwise you should have set hive.merge.mapfiles to true.


On Sat, Jun 12, 2010 at 11:22 PM, Sammy Yu wrote:
Hi,
I'm running the latest version of trunk r953172. I'm doing doing a
dynamic partition insert overwrite query which generates a lot of small
files in each of the partition. I was hoping this could be solved by
setting hive.merge.mapredfiles to true. However, it seems like whenever
the job is submitted it is always set to false, thus it doesnt seem to
have any effect. I also tried to modified this property in the
hive-default.xml, but it didn't work either.

Thanks,
Sammy

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 9 of 10 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedJun 13, '10 at 6:23a
activeJul 7, '10 at 1:52a
posts10
users5
websitehive.apache.org

People

Translate

site design / logo © 2021 Grokbase