One can run MR jobs on compressed files stored on the Hadoop DFS;
without needing to decompress chunks out of it manually. But there are
It'd help to know what kind of compression algorithm is being applied.
For instance, large blocks of GZip or DEFLATE files can't be 'split'
to many mappers; But BZip2 (and LZO with block index preparation) can
be split into blocks and each may be assigned a mapper, thus
parallelizing your execution.
IMO, a proper way to concatenate lots of 'small' files (say, hourly),
into one large file for efficient Hadoop MR execution on them would be
to use the provided SequenceFile format. Its a key-value storage and
can hold all your data into one file as <hour-filename-string (key),
collected-data (value)> (for example).
Then you could use SequenceFileInputFormat and can break down your
large value chunk into lines and etc, and then emit each key-val pair
as you need. The advantage here is that you can also compress these
files based on BLOCK (sequences of values are compressed together as a
block) or RECORD (each value is compressed independently). You can
also specify what compression technique to use (GZip, BZip2, LZO are
some of the available ones).
API reference: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
Doing it this way also gives you your 'raw ascii data' in its own form
inside your mapper (say Text or LongWritable, etc...). The
decompression is performed by the used input-format and record-readers
itself, abstracted away from the user.
On Fri, Aug 6, 2010 at 12:25 PM, fred smith wrote:
I am playing with netflow data on my small hadoop cluster (20 nodes)
just trying things out. I am a beginner on hadoop so please be gentle
I am currently running map reduce jobs on text (eg;formatted) netflow
files. They are already processed with flow-tools
(http://code.google.com/p/flow-tools/). I use streaming and python,
rather than coding in java, and it all works ok.
The issue I am facing is performance. I can concatenate one day's
formatted logs into a single file, and this will be about 18GB in
size. So, 18GB per day will be around 6.5TB of files per year.
But it takes a long time to do this, and is slow to process after.
The original data is heavily compressed - flow-tools is extremely
efficient at that! I am trying to sort out if I can do anything with
the binary datasets and so save space and hopefully get better
performance as well.
flow-tool file ----> flow-cat & flow-print -----> formatted text file
3GB binary ------------------------------------------> 18GB ASCII
The problem is that I don't think I can get the binary files to be
processed efficiently because of the compression can I? I can't split
the binary file and have it processed in parallel? Sorry for such a
basic question but I am having trouble imagining how the map reduce
with work with binary files.
Should I forget about binary and stop worrying about it being 6.5TB as
that isn't a lot in hadoop terms!