FAQ
If I want to read values out of input files as binary data, is this
what BytesWritable is for?

I've successfully run my first task that uses a SequenceFile for
output. Are there any examples of SequenceFile usage out there? I'd
like to see the full range of what SequenceFile can do. What are the
trade-offs between record compression and block compression? What are
the limits on the key and value sizes? How do you use the per-file
metadata?

My intended use is to read files on a local filesystem into a
SequenceFile, with the value of each record being the contents of each
file. I hacked MultiFileWordCount to get the basic concept working...
but I'd appreciate any advice from the experts. In particular, what's
the most efficient way to read data from an
InputStreamReader/BufferedReader into a BytesWritable object?

Thanks,

John

Search Discussions

  • Owen O'Malley at Sep 15, 2008 at 5:00 pm

    On Sep 14, 2008, at 7:15 PM, John Howland wrote:

    If I want to read values out of input files as binary data, is this
    what BytesWritable is for? yes
    I've successfully run my first task that uses a SequenceFile for
    output. Are there any examples of SequenceFile usage out there? I'd
    like to see the full range of what SequenceFile can do.
    If you want serious usage, I'd suggest pulling up Nutch. Distcp also
    uses sequence files as its input.

    You should also probably look at the TFile package that Hong is writing.

    https://issues.apache.org/jira/browse/HADOOP-3315

    Once it is ready, it will likely be exactly what you are looking for.
    What are the
    trade-offs between record compression and block compression?
    You pretty much always want block compression. The only place where
    record compression is ok, is if your value is web pages or some other
    huge chunk of text.
    What are
    the limits on the key and value sizes?
    Large. I think I've see keys and/or values of around 50-100mb. It
    certainly can't be bigger than 1g. I believe the TFile limit on keys
    may be 64k.
    How do you use the per-file
    metadata?
    It is just an application specific string to string map in the header
    of the file.
    My intended use is to read files on a local filesystem into a
    SequenceFile, with the value of each record being the contents of each
    file. I hacked MultiFileWordCount to get the basic concept working...
    You should also look at the Hadoop archives.
    http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html
    but I'd appreciate any advice from the experts. In particular, what's
    the most efficient way to read data from an
    InputStreamReader/BufferedReader into a BytesWritable object?
    The easiest way is the way you've done it. You probably want to use
    lzo compression too.

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 15, '08 at 2:16a
activeSep 15, '08 at 5:00p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Owen O'Malley: 1 post John Howland: 1 post

People

Translate

site design / logo © 2022 Grokbase