[ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-4927:

Issue Type: New Feature (was: Bug)

Changing this from a bug to a feature request. It seems reasonable for FileOutputFormat to support a mode where files are created lazily when the first record is written.
Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users having more than 10,000 empty files.
It sounds like there's also perhaps another problem here. Are these folks perhaps specifying way too many reduces? For jobs with lots of empty files, how many non-empty files are there, and how big are they?

Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there

Key: HADOOP-4927
URL: https://issues.apache.org/jira/browse/HADOOP-4927
Project: Hadoop Core
Issue Type: New Feature
Components: mapred
Reporter: Devaraj Das
Fix For: 0.20.0

When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 48 | next ›
Discussion Overview
groupcommon-dev @
postedDec 22, '08 at 6:01a
activeFeb 23, '09 at 3:19p

1 user in discussion

Hudson (JIRA): 48 posts



site design / logo © 2022 Grokbase