Our team is still new to Hadoop, and a colleague and I are trying to
make a decision on file formats. The arguments are:
* We should use a SequenceFile (binary) format as it's faster for the
machine to read than parsing text, and the files are smaller.
* We should use a text file format as it's easier for humans to read,
easier to change, text files can be compressed quite small, and a) if
the text format is designed well and b) given the context of a
distributed system like Hadoop where you can throw more nodes at a
problem, the text parsing time will wind up being negligible/irrelevant
in the overall processing time.
I realize I'm leaving out a lot of variables and specifics that could
impact this answer, but I'm just wondering if the Hadoop community had
any general rules of thumb about this like "favor (binary) sequence
files over text files" or some such.
If anyone has any general suggestions/advice here, please post back.
Thanks,
DR