On Sep 14, 2008, at 7:15 PM, John Howland wrote:
If I want to read values out of input files as binary data, is this
what BytesWritable is for? yes
I've successfully run my first task that uses a SequenceFile for
output. Are there any examples of SequenceFile usage out there? I'd
like to see the full range of what SequenceFile can do.
If you want serious usage, I'd suggest pulling up Nutch. Distcp also
uses sequence files as its input.
You should also probably look at the TFile package that Hong is writing.https://issues.apache.org/jira/browse/HADOOP-3315
Once it is ready, it will likely be exactly what you are looking for.
What are the
trade-offs between record compression and block compression?
You pretty much always want block compression. The only place where
record compression is ok, is if your value is web pages or some other
huge chunk of text.
the limits on the key and value sizes?
Large. I think I've see keys and/or values of around 50-100mb. It
certainly can't be bigger than 1g. I believe the TFile limit on keys
may be 64k.
How do you use the per-file
It is just an application specific string to string map in the header
of the file.
My intended use is to read files on a local filesystem into a
SequenceFile, with the value of each record being the contents of each
file. I hacked MultiFileWordCount to get the basic concept working...
You should also look at the Hadoop archives.http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html
but I'd appreciate any advice from the experts. In particular, what's
the most efficient way to read data from an
InputStreamReader/BufferedReader into a BytesWritable object?
The easiest way is the way you've done it. You probably want to use
lzo compression too.