FAQ
We have TB worth of XML data in .gz format where each file is about 20 MB.
This dataset is not expected to change. My goal is to write a map-only job
to read in one .gz file at a time and output the result in .lzo format.
Since there are a large number of .gz files, the map parallelism is expected
to be maximized. I am using Kevin Weil's LZO distribution and there does
not seem to be a LzoTextOutputFormat. When I got lzo to work before, I set
InputFormatClass to LzoTextInputFormat.class and map's output got lzo
compressed automatically. What does one configure for LZO output.

Current Job configuration code listed below does not work. XmlInputFormat
is my custom input format to read XML files.

job.setInputFormatClass(XmlInputFormat.class);
job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

String mapredOutputCompress = conf.get("mapred.output.compress");
if ("true".equals(mapredOutputCompress))
// this reads input and write output in lzo format
job.setInputFormatClass(LzoTextInputFormat.class);

Search Discussions

  • Ed at Sep 28, 2010 at 8:13 pm
    I've had luck doing the following in main (assuming lzo is setup properly)
    (I'm using Hadoop 20.2)

    FileOutputFormat.setCompressOutput(job, true);
    FileOutputFormat.setOutputCompressorClass(job,
    com.hadoop.compression.lzo.LzopCodec.class)

    Make sure kevin weil's jar file is accessible when building your jar, and is
    available on the cluster.
    You should see Lzo being loaded each time you run a job at the beginning

    Something like:

    INFO lzo.GPLNaitveCodeLoader: Loaded native gpl library
    INFO lzo.LzoCodec: Succesfully loaded & initialized native-lzo library

    (you should see both lines to make sure hadoop sees your jar and native
    library)

    Hope that works!

    ~Ed
    On Tue, Sep 28, 2010 at 3:06 PM, Steve Kuo wrote:

    We have TB worth of XML data in .gz format where each file is about 20 MB.
    This dataset is not expected to change. My goal is to write a map-only job
    to read in one .gz file at a time and output the result in .lzo format.
    Since there are a large number of .gz files, the map parallelism is
    expected
    to be maximized. I am using Kevin Weil's LZO distribution and there does
    not seem to be a LzoTextOutputFormat. When I got lzo to work before, I set
    InputFormatClass to LzoTextInputFormat.class and map's output got lzo
    compressed automatically. What does one configure for LZO output.

    Current Job configuration code listed below does not work. XmlInputFormat
    is my custom input format to read XML files.

    job.setInputFormatClass(XmlInputFormat.class);
    job.setMapperClass(XmlAnalyzer.XmlAnalyzerMapper.class);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);

    String mapredOutputCompress = conf.get("mapred.output.compress");
    if ("true".equals(mapredOutputCompress))
    // this reads input and write output in lzo format
    job.setInputFormatClass(LzoTextInputFormat.class);
  • Steve Kuo at Sep 28, 2010 at 8:02 pm
    Thanks, Ed. It works like a charm.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 28, '10 at 8:02p
activeSep 28, '10 at 8:13p
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Steve Kuo: 2 posts Ed: 1 post

People

Translate

site design / logo © 2022 Grokbase