Hi Folks,
I have a bunch of binary files which I've stored in a sequencefile.
The name of the file is the key, the data is the value and I've stored
them sorted by key. (I'm not tied to using a sequencefile for this).
The current test data is only 50MB, but the real data will be 500MB -
1GB.
My M/R job requires that it's input be several of these records in the
sequence file, which is determined by the key. The sorting mentioned
above keeps these all packed together.
1. Any reason not to use a sequence file for this? Perhaps a mapfile?
Since I've sorted it, I don't need "random" accesses, but I do need
to be aware of the keys, as I need to be sure that I get all of the
relevant keys sent to a given mapper
2. Looks like I want a custom inputformat for this, extending
SequenceFileInputFormat. Do you agree? I'll gladly take some
opinions on this, as I ultimately want to split the based on what's in
the file, which might be a little unorthodox.
3. Another idea might be create separate seq files for chunk of
records and make them non-splittable, ensuring that they go to a
single mapper. Assuming I can get away with this, see any pros/cons
with that approach?
Thanks,
Tom