FAQ
Hi,

I am playing with netflow data on my small hadoop cluster (20 nodes)
just trying things out. I am a beginner on hadoop so please be gentle
with me.

I am currently running map reduce jobs on text (eg;formatted) netflow
files. They are already processed with flow-tools
(http://code.google.com/p/flow-tools/). I use streaming and python,
rather than coding in java, and it all works ok.

The issue I am facing is performance. I can concatenate one day's
formatted logs into a single file, and this will be about 18GB in
size. So, 18GB per day will be around 6.5TB of files per year.
But it takes a long time to do this, and is slow to process after.

The original data is heavily compressed - flow-tools is extremely
efficient at that! I am trying to sort out if I can do anything with
the binary datasets and so save space and hopefully get better
performance as well.

Currently:
flow-tool file ----> flow-cat & flow-print -----> formatted text file
3GB binary ------------------------------------------> 18GB ASCII

The problem is that I don't think I can get the binary files to be
processed efficiently because of the compression can I? I can't split
the binary file and have it processed in parallel? Sorry for such a
basic question but I am having trouble imagining how the map reduce
with work with binary files.

Should I forget about binary and stop worrying about it being 6.5TB as
that isn't a lot in hadoop terms!

Thanks
Paul

Search Discussions

  • Harsh J at Aug 6, 2010 at 7:42 am
    Hello,

    One can run MR jobs on compressed files stored on the Hadoop DFS;
    without needing to decompress chunks out of it manually. But there are
    certain exceptions.

    It'd help to know what kind of compression algorithm is being applied.
    For instance, large blocks of GZip or DEFLATE files can't be 'split'
    to many mappers; But BZip2 (and LZO with block index preparation) can
    be split into blocks and each may be assigned a mapper, thus
    parallelizing your execution.

    IMO, a proper way to concatenate lots of 'small' files (say, hourly),
    into one large file for efficient Hadoop MR execution on them would be
    to use the provided SequenceFile format. Its a key-value storage and
    can hold all your data into one file as <hour-filename-string (key),
    collected-data (value)> (for example).

    Then you could use SequenceFileInputFormat and can break down your
    large value chunk into lines and etc, and then emit each key-val pair
    as you need. The advantage here is that you can also compress these
    files based on BLOCK (sequences of values are compressed together as a
    block) or RECORD (each value is compressed independently). You can
    also specify what compression technique to use (GZip, BZip2, LZO are
    some of the available ones).

    API reference: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
    More reading:http://www.cloudera.com/blog/2009/02/the-small-files-problem/

    Doing it this way also gives you your 'raw ascii data' in its own form
    inside your mapper (say Text or LongWritable, etc...). The
    decompression is performed by the used input-format and record-readers
    itself, abstracted away from the user.
    On Fri, Aug 6, 2010 at 12:25 PM, fred smith wrote:
    Hi,

    I am playing with netflow data on my small hadoop cluster (20 nodes)
    just trying things out. I am a beginner on hadoop so please be gentle
    with me.

    I am currently running map reduce jobs on text (eg;formatted) netflow
    files. They are already processed with flow-tools
    (http://code.google.com/p/flow-tools/). I use streaming and python,
    rather than coding in java, and it all works ok.

    The issue I am facing is performance. I can concatenate one day's
    formatted logs into a single file, and this will be about 18GB in
    size. So, 18GB per day will be around 6.5TB of files per year.
    But it takes a long time to do this, and is slow to process after.

    The original data is heavily compressed - flow-tools is extremely
    efficient at that! I am trying to sort out if I can do anything with
    the binary datasets and so save space and hopefully get better
    performance as well.

    Currently:
    flow-tool file ----> flow-cat & flow-print -----> formatted text file
    3GB binary  ------------------------------------------> 18GB ASCII

    The problem is that I don't think I can get the  binary files to be
    processed efficiently because of the compression can I? I can't split
    the binary file and have it processed in parallel? Sorry for such a
    basic question but I am having trouble imagining how the map reduce
    with work with binary files.

    Should I forget about binary and stop worrying about it being 6.5TB as
    that isn't a lot in  hadoop terms!

    Thanks
    Paul


    --
    Harsh J
    www.harshj.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedAug 6, '10 at 6:55a
activeAug 6, '10 at 7:42a
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Harsh J: 1 post Fred smith: 1 post

People

Translate

site design / logo © 2022 Grokbase