FAQ
Hi,
The Hadoop Definitive Guide book states that "if your input files are
compressed, they will be automatically decompressed as they are read
by MapReduce, using the filename extension to determine the codec to
use" (in the section titled "Using Compression in MapReduce"). I'm
trying to run a mapreduce job with some gzipped files as input and
this isn't working. Does support for this have to be built into the
input format? I'm using a custom one that extends from
FileInputFormat. Is there an additional configuration option that
should be set? I'd like to avoid having to do decompression from
within my map.

I'm using the new API and the CDH3b2 distro.

Thanks.

Search Discussions

  • Tom White at Oct 8, 2010 at 9:34 pm
    It's done by the RecordReader. For text-based input formats, which use
    LineRecordReader, decompression is carried out automatically. For
    others it's not (e.g. sequence files which have internal compression).
    So it depends on what your custom input format does.

    Cheers,
    Tom
    On Fri, Oct 8, 2010 at 1:58 PM, Patrick Marchwiak wrote:
    Hi,
    The Hadoop Definitive Guide book states that "if your input files are
    compressed, they will be automatically decompressed as they are read
    by MapReduce, using the filename extension to determine the codec to
    use" (in the section titled "Using Compression in MapReduce"). I'm
    trying to run a mapreduce job with some gzipped files as input and
    this isn't working. Does support for this have to be built into the
    input format? I'm using a custom one that extends from
    FileInputFormat. Is there an additional configuration option that
    should be set?  I'd like to avoid having to do decompression from
    within my map.

    I'm using the new API and the CDH3b2 distro.

    Thanks.
  • Patrick Marchwiak at Oct 8, 2010 at 11:58 pm
    Thanks for the explanation. My input format uses its own RecordReader
    so it looks like I'll have to add compression support to it myself.
    On Fri, Oct 8, 2010 at 2:34 PM, Tom White wrote:
    It's done by the RecordReader. For text-based input formats, which use
    LineRecordReader, decompression is carried out automatically. For
    others it's not (e.g. sequence files which have internal compression).
    So it depends on what your custom input format does.

    Cheers,
    Tom
    On Fri, Oct 8, 2010 at 1:58 PM, Patrick Marchwiak wrote:
    Hi,
    The Hadoop Definitive Guide book states that "if your input files are
    compressed, they will be automatically decompressed as they are read
    by MapReduce, using the filename extension to determine the codec to
    use" (in the section titled "Using Compression in MapReduce"). I'm
    trying to run a mapreduce job with some gzipped files as input and
    this isn't working. Does support for this have to be built into the
    input format? I'm using a custom one that extends from
    FileInputFormat. Is there an additional configuration option that
    should be set?  I'd like to avoid having to do decompression from
    within my map.

    I'm using the new API and the CDH3b2 distro.

    Thanks.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedOct 8, '10 at 8:59p
activeOct 8, '10 at 11:58p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Patrick Marchwiak: 2 posts Tom White: 1 post

People

Translate

site design / logo © 2022 Grokbase