FAQ
Hi Folks,

I have a bunch of binary files which I've stored in a sequencefile.
The name of the file is the key, the data is the value and I've stored
them sorted by key. (I'm not tied to using a sequencefile for this).
The current test data is only 50MB, but the real data will be 500MB -
1GB.

My M/R job requires that it's input be several of these records in the
sequence file, which is determined by the key. The sorting mentioned
above keeps these all packed together.

1. Any reason not to use a sequence file for this? Perhaps a mapfile?
Since I've sorted it, I don't need "random" accesses, but I do need
to be aware of the keys, as I need to be sure that I get all of the
relevant keys sent to a given mapper

2. Looks like I want a custom inputformat for this, extending
SequenceFileInputFormat. Do you agree? I'll gladly take some
opinions on this, as I ultimately want to split the based on what's in
the file, which might be a little unorthodox.

3. Another idea might be create separate seq files for chunk of
records and make them non-splittable, ensuring that they go to a
single mapper. Assuming I can get away with this, see any pros/cons
with that approach?

Thanks,

Tom

--
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Search Discussions

  • Joey Echeverria at Jul 27, 2011 at 12:43 pm

    1. Any reason not to use a sequence file for this?  Perhaps a mapfile?
    Since I've sorted it, I don't need "random" accesses, but I do need
    to be aware of the keys, as I need to be sure that I get all of the
    relevant keys sent to a given mapper
    MapFile *may* be better here (see my answer for 2 below).
    2. Looks like I want a custom inputformat for this, extending
    SequenceFileInputFormat.  Do you agree?  I'll gladly take some
    opinions on this, as I ultimately want to split the based on what's in
    the file, which might be a little unorthodox.
    If you need to split based on where certain keys are in the file, then
    a SequenceFile isn't a great solution. It would require that your
    InputFormat scan through all of the data just to find split points.
    Assuming you know what keys to split on ahead of time, you could use
    MapFiles and find the exact split point more quickly.
    3. Another idea might be create separate seq files for chunk of
    records and make them non-splittable, ensuring that they go to a
    single mapper.  Assuming I can get away with this, see any pros/cons
    with that approach?
    Separate sequence files would require the least amount of custom code.

    -Joey

    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Tom Melendez at Jul 27, 2011 at 6:09 pm

    3. Another idea might be create separate seq files for chunk of
    records and make them non-splittable, ensuring that they go to a
    single mapper.  Assuming I can get away with this, see any pros/cons
    with that approach?
    Separate sequence files would require the least amount of custom code.
    Thanks for the response, Joey.

    So, if I were to do the above, I would still need a custom record
    reader to put all the keys and values together, right?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs
  • Joey Echeverria at Jul 27, 2011 at 8:41 pm
    You could either use a custom RecordReader or you could override the
    run() method on your Mapper class to do the merging before calling the
    map() method.

    -Joey
    On Wed, Jul 27, 2011 at 11:09 AM, Tom Melendez wrote:

    3. Another idea might be create separate seq files for chunk of
    records and make them non-splittable, ensuring that they go to a
    single mapper.  Assuming I can get away with this, see any pros/cons
    with that approach?
    Separate sequence files would require the least amount of custom code.
    Thanks for the response, Joey.

    So, if I were to do the above, I would still need a custom record
    reader to put all the keys and values together, right?

    Thanks,

    Tom

    --
    ===================
    Skybox is hiring.
    http://www.skyboximaging.com/careers/jobs


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 27, '11 at 6:29a
activeJul 27, '11 at 8:41p
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Tom Melendez: 2 posts Joey Echeverria: 2 posts

People

Translate

site design / logo © 2023 Grokbase