FAQ
Hello,

I am running a custom crawler (written internally) using hadoop
streaming. I am attempting to
compress the output using LZO, but instead I am receiving corrupted
output that is neither in the
format I am aiming for nor as a compressed lzo file. Is this a known
issue? Is there anything
I am doing inherently wrong?

Here is the command line I am using:

~/hadoop/bin/hadoop jar
/home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
-inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
-mapper /home/hadoop/crawl_map -reducer NONE -jobconf
mapred.output.compress=true -jobconf
mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
-input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0

The input is in in form of URLs stored as a SequenceFile

When running this without LZO compression, no such issue occurs.

Is there any way for me to recover the corrupted data as to be able to
process it by other
hadoop jobs or offline?

Thanks,

--
Alex Feinberg
Platform Engineer, SocialMedia Networks

Search Discussions

  • Chris Douglas at Sep 19, 2008 at 9:36 am
    It's probably not corrupted. If by "compressed lzo file" you mean
    something readable with lzop, you should use LzopCodec, not LzoCodec.
    LzoCodec doesn't write header information required by that tool.

    Guessing at the output format (length encoded blocks of data
    compressed by the lzo algorithm), it's probably readable by
    TextInputFormat, but YMMV. If you wanted to use the C tool, you'll
    have to add the appropriate header (see lzop source or LzopCodec)
    using a hex editor and four zero bytes to the end of the file. You can
    also use lzo compression in SequenceFiles. -C
    On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:

    Hello,

    I am running a custom crawler (written internally) using hadoop
    streaming. I am attempting to
    compress the output using LZO, but instead I am receiving corrupted
    output that is neither in the
    format I am aiming for nor as a compressed lzo file. Is this a known
    issue? Is there anything
    I am doing inherently wrong?

    Here is the command line I am using:

    ~/hadoop/bin/hadoop jar
    /home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
    -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
    -mapper /home/hadoop/crawl_map -reducer NONE -jobconf
    mapred.output.compress=true -jobconf
    mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
    -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0

    The input is in in form of URLs stored as a SequenceFile

    When running this without LZO compression, no such issue occurs.

    Is there any way for me to recover the corrupted data as to be able to
    process it by other
    hadoop jobs or offline?

    Thanks,

    --
    Alex Feinberg
    Platform Engineer, SocialMedia Networks
  • Alex Feinberg at Sep 19, 2008 at 3:46 pm
    Hi Chris,

    I was also unable to decompress by simply doing a map/reducer with "cat"
    as a mapper and then doing dfs -get either.

    I will try using LzopCodec.

    Thanks,
    - Alex
    On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas wrote:
    It's probably not corrupted. If by "compressed lzo file" you mean something
    readable with lzop, you should use LzopCodec, not LzoCodec. LzoCodec doesn't
    write header information required by that tool.

    Guessing at the output format (length encoded blocks of data compressed by
    the lzo algorithm), it's probably readable by TextInputFormat, but YMMV. If
    you wanted to use the C tool, you'll have to add the appropriate header (see
    lzop source or LzopCodec) using a hex editor and four zero bytes to the end
    of the file. You can also use lzo compression in SequenceFiles. -C
    On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:

    Hello,

    I am running a custom crawler (written internally) using hadoop
    streaming. I am attempting to
    compress the output using LZO, but instead I am receiving corrupted
    output that is neither in the
    format I am aiming for nor as a compressed lzo file. Is this a known
    issue? Is there anything
    I am doing inherently wrong?

    Here is the command line I am using:

    ~/hadoop/bin/hadoop jar
    /home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
    -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
    -mapper /home/hadoop/crawl_map -reducer NONE -jobconf
    mapred.output.compress=true -jobconf
    mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
    -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0

    The input is in in form of URLs stored as a SequenceFile

    When running this without LZO compression, no such issue occurs.

    Is there any way for me to recover the corrupted data as to be able to
    process it by other
    hadoop jobs or offline?

    Thanks,

    --
    Alex Feinberg
    Platform Engineer, SocialMedia Networks


    --
    Alex Feinberg
    Platform Engineer, SocialMedia Networks
  • Chris Douglas at Sep 23, 2008 at 1:10 am
    If you're using TextInputFormat, you need to add LzoCodec to the list
    of codecs in the io.compression.codecs property.

    LzopCodec is only for reading/writing files produced/consumed by the C
    tool; it's not in 0.17. The ".lzo" files produced in 0.17 are not
    "real" .lzo files, but that's how you can get the codec to recognize
    them in this version. In the future, you might want to just use the
    lzo codec with SequenceFileOutputFormat (use BLOCK compression). -C
    On Sep 19, 2008, at 8:46 AM, Alex Feinberg wrote:

    Hi Chris,

    I was also unable to decompress by simply doing a map/reducer with
    "cat"
    as a mapper and then doing dfs -get either.

    I will try using LzopCodec.

    Thanks,
    - Alex

    On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas <chrisdo@yahoo-
    inc.com> wrote:
    It's probably not corrupted. If by "compressed lzo file" you mean
    something
    readable with lzop, you should use LzopCodec, not LzoCodec.
    LzoCodec doesn't
    write header information required by that tool.

    Guessing at the output format (length encoded blocks of data
    compressed by
    the lzo algorithm), it's probably readable by TextInputFormat, but
    YMMV. If
    you wanted to use the C tool, you'll have to add the appropriate
    header (see
    lzop source or LzopCodec) using a hex editor and four zero bytes to
    the end
    of the file. You can also use lzo compression in SequenceFiles. -C
    On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:

    Hello,

    I am running a custom crawler (written internally) using hadoop
    streaming. I am attempting to
    compress the output using LZO, but instead I am receiving corrupted
    output that is neither in the
    format I am aiming for nor as a compressed lzo file. Is this a known
    issue? Is there anything
    I am doing inherently wrong?

    Here is the command line I am using:

    ~/hadoop/bin/hadoop jar
    /home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
    -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
    -mapper /home/hadoop/crawl_map -reducer NONE -jobconf
    mapred.output.compress=true -jobconf
    mapred
    .output.compression.codec=org.apache.hadoop.io.compress.LzoCodec
    -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0

    The input is in in form of URLs stored as a SequenceFile

    When running this without LZO compression, no such issue occurs.

    Is there any way for me to recover the corrupted data as to be
    able to
    process it by other
    hadoop jobs or offline?

    Thanks,

    --
    Alex Feinberg
    Platform Engineer, SocialMedia Networks


    --
    Alex Feinberg
    Platform Engineer, SocialMedia Networks

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 19, '08 at 4:15a
activeSep 23, '08 at 1:10a
posts4
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Chris Douglas: 2 posts Alex Feinberg: 2 posts

People

Translate

site design / logo © 2022 Grokbase