FAQ
Hi,

My specific question is: is it possible to control the split of Lzo
files by customize the Lzo index files?

The background of the problem is:

I have a file which has the following format

key1 value1
key1 value2
key2 value3
key2 value4
...

Its size in plain text before compression is 11 M. After Lzo
compression, the size is 681 K. I tried this on two formats: Text
format and Sequence format with block compression. They are almost the
same.

However, when I join the same keys together and reformat the file as

key1 value1 value2
key2 value3 value4
...

The size before compression is of course more or less the same, 11M. But
after Lzo compression, the size is 4.8 M. My guess is: maybe the Lzo
compression algorithm could compress a lot of similar values in the
first format, whereas in the second format the concatenation of
multiple values are less likely to be identical, therefore the
compression rate decreases.

So, again my question is, if I would like to keep the file in the first
format, I would prohibit mapper to split the file within the same key.
For example, all "key1" should go to the same mapper. Is it doable on a
Lzo file? Because the split behavior of Lzo files relies on the index
files, is there anyway to control the split by customizing the Lzo index
files?

BTW, when using the second format, I found that bzip2 has better
compression rate than Lzo (2.1 M). Did I made any mistake when using
Lzo compression?

Thanks!

Best Regards,

Shi

Search Discussions

  • Dmitriy Ryaboy at Jun 23, 2011 at 9:36 pm
    Shi,
    bzip compresses much better than lzo. It is also significantly more
    expensive (we are talking orders of magnitude) than LZO, both on compression
    and decompression.

    As for your question regarding custom splits -- LzoIndex does not support
    this kind of logic, as it's written to be generic and doesn't know how to
    read individual records, but you can certainly customize it to fit your use
    case.

    D


    On Thu, Jun 23, 2011 at 1:59 PM, Shi Yu wrote:

    Hi,

    My specific question is: is it possible to control the split of Lzo files
    by customize the Lzo index files?

    The background of the problem is:

    I have a file which has the following format

    key1 value1
    key1 value2
    key2 value3
    key2 value4
    ...

    Its size in plain text before compression is 11 M. After Lzo compression,
    the size is 681 K. I tried this on two formats: Text format and Sequence
    format with block compression. They are almost the same.

    However, when I join the same keys together and reformat the file as

    key1 value1 value2
    key2 value3 value4
    ...

    The size before compression is of course more or less the same, 11M. But
    after Lzo compression, the size is 4.8 M. My guess is: maybe the Lzo
    compression algorithm could compress a lot of similar values in the first
    format, whereas in the second format the concatenation of multiple values
    are less likely to be identical, therefore the compression rate decreases.

    So, again my question is, if I would like to keep the file in the first
    format, I would prohibit mapper to split the file within the same key. For
    example, all "key1" should go to the same mapper. Is it doable on a Lzo
    file? Because the split behavior of Lzo files relies on the index files, is
    there anyway to control the split by customizing the Lzo index files?

    BTW, when using the second format, I found that bzip2 has better
    compression rate than Lzo (2.1 M). Did I made any mistake when using Lzo
    compression?

    Thanks!

    Best Regards,

    Shi

  • Shi Yu at Jun 23, 2011 at 9:52 pm
    Thanks Dmitriy!

    Not sure how much work it will be. I guess I should customize the
    InputFormat class in this case, right?

    Shi
    *
    *On 6/23/2011 4:35 PM, Dmitriy Ryaboy wrote:
    Shi,
    bzip compresses much better than lzo. It is also significantly more
    expensive (we are talking orders of magnitude) than LZO, both on compression
    and decompression.

    As for your question regarding custom splits -- LzoIndex does not support
    this kind of logic, as it's written to be generic and doesn't know how to
    read individual records, but you can certainly customize it to fit your use
    case.

    D



    On Thu, Jun 23, 2011 at 1:59 PM, Shi Yuwrote:
    Hi,

    My specific question is: is it possible to control the split of Lzo files
    by customize the Lzo index files?

    The background of the problem is:

    I have a file which has the following format

    key1 value1
    key1 value2
    key2 value3
    key2 value4
    ...

    Its size in plain text before compression is 11 M. After Lzo compression,
    the size is 681 K. I tried this on two formats: Text format and Sequence
    format with block compression. They are almost the same.

    However, when I join the same keys together and reformat the file as

    key1 value1 value2
    key2 value3 value4
    ...

    The size before compression is of course more or less the same, 11M. But
    after Lzo compression, the size is 4.8 M. My guess is: maybe the Lzo
    compression algorithm could compress a lot of similar values in the first
    format, whereas in the second format the concatenation of multiple values
    are less likely to be identical, therefore the compression rate decreases.

    So, again my question is, if I would like to keep the file in the first
    format, I would prohibit mapper to split the file within the same key. For
    example, all "key1" should go to the same mapper. Is it doable on a Lzo
    file? Because the split behavior of Lzo files relies on the index files, is
    there anyway to control the split by customizing the Lzo index files?

    BTW, when using the second format, I found that bzip2 has better
    compression rate than Lzo (2.1 M). Did I made any mistake when using Lzo
    compression?

    Thanks!

    Best Regards,

    Shi

  • Bharath Mundlapudi at Jun 24, 2011 at 7:01 am
    BTW, when using the second format, I found that bzip2 has better compression rate than Lzo (2.1 M).  Did I made any mistake when using Lzo compression?
    It depends on your requirements. Like if you prefer high compression rate over performance. bzip2 is orders of magnitude slower than Lzo.

    -Bharath

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 23, '11 at 8:59p
activeJun 24, '11 at 7:01a
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase