Ah yea, I forgot to mention that part. Our raw logs are obviously compressed as well, but I'm pushing them frequently enough that there are always enough splits/maps to saturate the nodes.

I heard some discussion a little while back about development of a bz2/zip codec, which would be splittable (see http://www.nabble.com/Compression-using-Hadoop...-tf4354954.html#a12432166 ). But I don't know how much good that would do me... I really need to be able to compress in an 'online' fashion that seems more difficult to achieve with bzip2.

So repeatedly reading the raw logs is out, due to their being compressed, but also because it is a very small number of events that aren't emitted on the first go round.

Any ideas?


-----Original Message-----
From: Ted Dunning <tdunning@veoh.com>
Sent: Sunday, September 30, 2007 7:19pm
To: hadoop-user@lucene.apache.org
Subject: Re: InputFormat for Two Types

Depending on how your store your raw log data, it might or might not be
suitable for repeated reading. In my case, I have a situation very much
like yours, but my log files are encrypted and compressed using a stream
compression. That means that I can't split those files, which is a real
pity because any processing job on less than a few days of data takes the
same amount of time. I would LOVE it if processing an hour of data took a
LOT less time than processing 2 days of data.

As a result, I am looking at converting all those logs whenever they are in
the HDFS. What I would particularly like is a good compressed format that
handles lots of missing data well (tab-delimited does this well because of
the heavy compression of repeated tabs), but I want to be able to split
input files. TextInputFormat, unfortunately, has this test in it:

protected boolean isSplitable(FileSystem fs, Path file) {
return compressionCodecs.getCodec(file) == null;

This seems to indicate that textual files can be both compressed and split.

On the other hand, SequenceFiles are splittable, but it isn't clear how well
they will handle missing or empty fields. That is my next experiment,
On 9/30/07 3:33 PM, "Stu Hood" wrote:


I need to write a mapreduce program that begins with 2 jobs:
1. Convert raw log data to SequenceFiles
2 Read from SequenceFiles, and cherry pick completed events
(otherwise, keep them as SequenceFiles to be checked again later)
But I should be able to compact those 2 jobs into 1 job.

I just need to figure out how to write an InputFormat that uses 2 types of
RecordReaders, depending on the input file type. Specifically, the inputs
would be either raw log data (TextInputFormat), or partially processed log
data (SequenceFileInputFormat).

I think I need to extend SequenceFileInputFormat to look for an identifying
extension on the files. Then I would be able to return either a
LineRecordReader or a SequenceFileRecordReader, and some logic in Map could
process the line into a record.

Am I headed in the right direction? Or should I stick with running 2 jobs
instead of trying to squash these steps into 1?


Stu Hood


"You manage your business. We'll manage your email."®

Search Discussions

  • Ted Dunning at Oct 1, 2007 at 2:42 am
    Sorry, I think I said something confusing.

    Repeatedly reading is inefficient in my case because of the cost of
    decryption and log line parsing. Compression is usually GOOD in these cases
    because you are effectively multiplying the disk read rate by the
    compression rate (possibly 10 or 20x for log files) at the relatively
    moderate cost of some CPU cycles.

    The reason for changing to a different compression type in my case is so
    that files can be sub-divided. This has two benefits. The obvious benefit
    is higher potential parallelism while still keeping the file size large.
    This is less important if you are rolling your files often as you say. The
    second, less obvious benefit is that you have more efficient load balancing
    if you can divide your input in to 3-5 times more pieces than you have task
    nodes. This happens because faster nodes can munch on more pieces than the
    slower nodes. If you have absolutely uniform tasks and absolutely uniform
    nodes, then this won't help, but I can't help thinking that with log files
    rotated by time, you will have at least 2x variation. That means that a
    significant number of log files will be much shorter tasks than others and
    many nodes will go idle in the last half of the map phase. If you combine
    this with 2x variation in speed due to task location and machine calibre,
    you could have considerable slack in the last 75% of the map phase. Not

    On 9/30/07 5:50 PM, "Stu Hood" wrote:

    So repeatedly reading the raw logs is out, due to their being compressed, but
    also because it is a very small number of events that aren't emitted on the
    first go round.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
postedOct 1, '07 at 12:52a
activeOct 1, '07 at 2:42a

2 users in discussion

Stu Hood: 1 post Ted Dunning: 1 post



site design / logo © 2022 Grokbase