FAQ
We have actually written a custom Multi File Splitter that collapses all
the small files to a single split till the DFS Block Size is hit.
We also take care of handling big files by splitting them on Block Size
and adding up all the reminders(if any) to a single split.

It works great for us....:-)
We are working on optimizing it further to club all the small files in a
single data node together so that the Map can have maximum local data.

We plan to share this(provided it's found acceptable, of course) once
this is done.

Regards,
Subru

Stuart Sierra wrote:
Thanks for the advice, everyone. I'm going to go with #2, packing my
million files into a small number of SequenceFiles. This is slow, but
only has to be done once. My "datacenter" is Amazon Web Services :),
so storing a few large, compressed files is the easiest way to go.

My code, if anyone's interested, is here:
http://stuartsierra.com/2008/04/24/a-million-little-files

-Stuart
altlaw.org

On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra wrote:

Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys. Will that
work?

Thanks,
-Stuart, altlaw.org

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 9 of 12 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 23, '08 at 3:56p
activeApr 28, '08 at 4:13p
posts12
users8
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase