Grokbase Groups Pig user March 2009
FAQ
I am considering starting to use compression for data files I process
with PIG. I am using trunk version of PIG
on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
to have total few dozen terabytes of uncompressed data.
DFS block size I am using is 96Mb.

I am looking for a feedback on idea of using {B|G}ZIP compressed
files. Could PIG handle them? How would it affect splittng?
I have read somewhere that bzip files could be split, whereas gzip
could not. Could somebody confirm this?

Thanks!

Vadim

Search Discussions

  • Alan Gates at Mar 16, 2009 at 3:11 pm
    I haven't worked extensively with compressed data, so I'll let others
    who have share their experience. But pig does work with bzip data,
    which can be split. PigStorage checks to see if the input or output
    file ends in .bz, and if so uses bzip to read/write the data. There
    have been some bugs in this code, so you should make sure you have the
    top of trunk version as it's been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports
    compressing each block. As I understand it lzo is preferred over gzip
    for this. But when you use this, it works fine with pig because
    hadoop handles the (de)compression underneath pig. You should be able
    to find info on how to do this on your cluster in the hadoop docs.

    Alan.
    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Tamir Kamara at Mar 16, 2009 at 6:10 pm
    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.

    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others who
    have share their experience. But pig does work with bzip data, which can be
    split. PigStorage checks to see if the input or output file ends in .bz,
    and if so uses bzip to read/write the data. There have been some bugs in
    this code, so you should make sure you have the top of trunk version as it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports compressing
    each block. As I understand it lzo is preferred over gzip for this. But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Mridul Muralidharan at Mar 16, 2009 at 7:24 pm

    Tamir Kamara wrote:
    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.

    Can you elaborate on what you mean by uncompressed output ?

    IIRC hadoop does block compression - so a fetch would give you
    uncompressed file, and a put compresses it on the fly - the actual data
    is stored in compressed form on the hdfs blocks.
    So disc 'space' goes up - with a (de)compression time tradeoff. All
    manipulation of the file would give you an uncompressed view of the file
    iirc.

    Please do let me know in case my understanding is wrong !

    Thanks,
    Mridul
    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others who
    have share their experience. But pig does work with bzip data, which can be
    split. PigStorage checks to see if the input or output file ends in .bz,
    and if so uses bzip to read/write the data. There have been some bugs in
    this code, so you should make sure you have the top of trunk version as it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports compressing
    each block. As I understand it lzo is preferred over gzip for this. But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Tamir Kamara at Mar 17, 2009 at 1:28 pm
    Hi,

    What do you mean by IIRC ?

    I tried this again just now and it seems that hadoop does compress the job
    output (I guess I did something wrong before), I know this because of the
    out view from the dfs web interface which displays the part files as if they
    were compressed.
    By the way, how do I get the compressed output out of the dfs and decompress
    it ?

    Thanks,
    Tamir


    On Mon, Mar 16, 2009 at 9:23 PM, Mridul Muralidharan
    wrote:
    Tamir Kamara wrote:
    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression
    -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of
    the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read
    also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop
    compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.
    Can you elaborate on what you mean by uncompressed output ?

    IIRC hadoop does block compression - so a fetch would give you uncompressed
    file, and a put compresses it on the fly - the actual data is stored in
    compressed form on the hdfs blocks.
    So disc 'space' goes up - with a (de)compression time tradeoff. All
    manipulation of the file would give you an uncompressed view of the file
    iirc.

    Please do let me know in case my understanding is wrong !

    Thanks,
    Mridul


    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others who
    have share their experience. But pig does work with bzip data, which can
    be
    split. PigStorage checks to see if the input or output file ends in .bz,
    and if so uses bzip to read/write the data. There have been some bugs in
    this code, so you should make sure you have the top of trunk version as
    it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports
    compressing
    each block. As I understand it lzo is preferred over gzip for this. But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how
    to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Tamir Kamara at Mar 18, 2009 at 11:49 am
    Hi,

    To more specific I tested the lzo output compression with both pig and
    hadoop. A MR job written without pig produces part files with the lzo suffix
    which means that indeed hadoop compressed the out of the job. However, the
    output of a similar job written in pig using the same configurtion related
    to the compression (stated in the hadoop-site.xml I attached) doesn't
    produce a compressed output. I also attached the job xml for the pig job
    which shows pig gets all the relevant compression properties that should
    produce a lzo output.

    Am I doing something wrong ?


    On Mon, Mar 16, 2009 at 9:23 PM, Mridul Muralidharan
    wrote:
    Tamir Kamara wrote:
    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression
    -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of
    the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read
    also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop
    compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.
    Can you elaborate on what you mean by uncompressed output ?

    IIRC hadoop does block compression - so a fetch would give you uncompressed
    file, and a put compresses it on the fly - the actual data is stored in
    compressed form on the hdfs blocks.
    So disc 'space' goes up - with a (de)compression time tradeoff. All
    manipulation of the file would give you an uncompressed view of the file
    iirc.

    Please do let me know in case my understanding is wrong !

    Thanks,
    Mridul


    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others who
    have share their experience. But pig does work with bzip data, which can
    be
    split. PigStorage checks to see if the input or output file ends in .bz,
    and if so uses bzip to read/write the data. There have been some bugs in
    this code, so you should make sure you have the top of trunk version as
    it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports
    compressing
    each block. As I understand it lzo is preferred over gzip for this. But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how
    to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Benjamin Reed at Mar 17, 2009 at 1:09 pm
    can you give more information on the wrong results you are getting? it would be great if we could reproduce the problem.

    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Monday, March 16, 2009 11:10 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.

    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others who
    have share their experience. But pig does work with bzip data, which can be
    split. PigStorage checks to see if the input or output file ends in .bz,
    and if so uses bzip to read/write the data. There have been some bugs in
    this code, so you should make sure you have the top of trunk version as it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports compressing
    each block. As I understand it lzo is preferred over gzip for this. But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Tamir Kamara at Mar 17, 2009 at 1:19 pm
    Sure. My query is simple enough:

    --links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
    source:int);
    links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
    a = filter links by target==98;
    a1 = foreach a generate source;
    b = JOIN links by source, a1 by source USING "replicated";
    c = group b by links::source;
    d = foreach c generate group as source, COUNT(*);
    dump d;

    I used the same source file to create both the bz file and the splitted gz
    files. The right results were produced with the gz files and bz results were
    off by 1 or 2 for all records.

    Thanks,
    Tamir

    On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed wrote:

    can you give more information on the wrong results you are getting? it
    would be great if we could reproduce the problem.

    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Monday, March 16, 2009 11:10 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.

    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others who
    have share their experience. But pig does work with bzip data, which can be
    split. PigStorage checks to see if the input or output file ends in .bz,
    and if so uses bzip to read/write the data. There have been some bugs in
    this code, so you should make sure you have the top of trunk version as it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports
    compressing
    each block. As I understand it lzo is preferred over gzip for this. But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Benjamin Reed at Mar 17, 2009 at 2:34 pm
    is there a way to reproduce the dataset?
    thanx
    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Tuesday, March 17, 2009 6:19 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Sure. My query is simple enough:

    --links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
    source:int);
    links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
    a = filter links by target==98;
    a1 = foreach a generate source;
    b = JOIN links by source, a1 by source USING "replicated";
    c = group b by links::source;
    d = foreach c generate group as source, COUNT(*);
    dump d;

    I used the same source file to create both the bz file and the splitted gz
    files. The right results were produced with the gz files and bz results were
    off by 1 or 2 for all records.

    Thanks,
    Tamir

    On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed wrote:

    can you give more information on the wrong results you are getting? it
    would be great if we could reproduce the problem.

    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Monday, March 16, 2009 11:10 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.

    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others who
    have share their experience. But pig does work with bzip data, which can be
    split. PigStorage checks to see if the input or output file ends in .bz,
    and if so uses bzip to read/write the data. There have been some bugs in
    this code, so you should make sure you have the top of trunk version as it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports
    compressing
    each block. As I understand it lzo is preferred over gzip for this. But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Tamir Kamara at Mar 17, 2009 at 3:51 pm
    I have the data and I just verified that the problem described is still
    happening. Do you want me to try something else on it ?

    On Tue, Mar 17, 2009 at 4:33 PM, Benjamin Reed wrote:

    is there a way to reproduce the dataset?
    thanx
    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Tuesday, March 17, 2009 6:19 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Sure. My query is simple enough:

    --links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
    source:int);
    links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
    a = filter links by target==98;
    a1 = foreach a generate source;
    b = JOIN links by source, a1 by source USING "replicated";
    c = group b by links::source;
    d = foreach c generate group as source, COUNT(*);
    dump d;

    I used the same source file to create both the bz file and the splitted gz
    files. The right results were produced with the gz files and bz results
    were
    off by 1 or 2 for all records.

    Thanks,
    Tamir

    On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed wrote:

    can you give more information on the wrong results you are getting? it
    would be great if we could reproduce the problem.

    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Monday, March 16, 2009 11:10 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.

    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others
    who
    have share their experience. But pig does work with bzip data, which
    can
    be
    split. PigStorage checks to see if the input or output file ends in
    .bz,
    and if so uses bzip to read/write the data. There have been some bugs
    in
    this code, so you should make sure you have the top of trunk version as it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports
    compressing
    each block. As I understand it lzo is preferred over gzip for this.
    But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig. You should be able to find info on how to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Vadim Zaliva at Mar 17, 2009 at 6:44 pm
    I am experimenting with BZIP2-compressed data and now my task get killed
    after getting bunch of these:

    Task attempt_200903131720_0047_m_000270_0 failed to report status for
    627 seconds. Killing!

    I suspect it is bzip-related, but I am not 100% sure yet.

    Vadim
    On Tue, Mar 17, 2009 at 08:50, Tamir Kamara wrote:
    I have the data and I just verified that the problem described is still
    happening. Do you want me to try something else on it ?

    On Tue, Mar 17, 2009 at 4:33 PM, Benjamin Reed wrote:

    is there a way to reproduce the dataset?
    thanx
    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Tuesday, March 17, 2009 6:19 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Sure. My query is simple enough:

    --links = LOAD '/user/hadoop/links/links.txt.bz2' AS (target:int,
    source:int);
    links = LOAD '/user/hadoop/links/links-gz/*' AS (target:int, source:int);
    a = filter links by target==98;
    a1 = foreach a generate source;
    b = JOIN links by source, a1 by source USING "replicated";
    c = group b by links::source;
    d = foreach c generate group as source, COUNT(*);
    dump d;

    I used the same source file to create both the bz file and the splitted gz
    files. The right results were produced with the gz files and bz results
    were
    off by 1 or 2 for all records.

    Thanks,
    Tamir


    On Tue, Mar 17, 2009 at 3:08 PM, Benjamin Reed <breed@yahoo-inc.com>
    wrote:
    can you give more information on the wrong results you are getting? it
    would be great if we could reproduce the problem.

    ben

    -----Original Message-----
    From: Tamir Kamara
    Sent: Monday, March 16, 2009 11:10 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    Hi,

    I did some testing with both gzip and bzip2.
    As Alan wrote, bz has the advantage of being splittable out of the box but
    the disadvantage is its performance both in compression and decompression -
    bz is slow I don't think the smaller file is worth it.
    I also got wrong results when using bz files with the latest trunk which
    suggests that there're still some problems. I've emailed the details of the
    problem here a week ago.
    For now, when I need to I split the files manually and use gzip before
    moving them into the dfs within a specific directory and then load that
    entire directory with pig.
    I also tried to use lzo but had some problems with it. What I did see is
    that lzo is faster than gzip but produces larger files.
    As I understand the situation, pig can only write to bz files but read also
    gz, lzo and zlib (handled by hadoop).
    I originally wanted pig to write normal text files and have hadoop compress
    the output to the other compression types (e.g. lzo), and I configured
    hadoop as mentioned in the docs but still got an uncompressed output. If
    anyone knows how to use this feature, please write.

    Tamir

    On Mon, Mar 16, 2009 at 5:10 PM, Alan Gates wrote:

    I haven't worked extensively with compressed data, so I'll let others
    who
    have share their experience.  But pig does work with bzip data, which
    can
    be
    split.  PigStorage checks to see if the input or output file ends in
    .bz,
    and if so uses bzip to read/write the data.  There have been some bugs
    in
    this code, so you should make sure you have the top of trunk version as it's
    been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig.  But, hadoop now supports
    compressing
    each block.  As I understand it lzo is preferred over gzip for this.
    But
    when you use this, it works fine with pig because hadoop handles the
    (de)compression underneath pig.  You should be able to find info on how to
    do this on your cluster in the hadoop docs.

    Alan.


    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim
  • Benjamin Reed at Mar 17, 2009 at 1:13 pm
    just to clarify gzip a little bit, gzip files cannot be split, so if you have gzipped input, you will get a map task for each file. it will still work, but if you don't have at least as many files as you have map task slots, you will lose some parallelism.

    ben

    -----Original Message-----
    From: Alan Gates
    Sent: Monday, March 16, 2009 8:10 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: bzip/gzip

    I haven't worked extensively with compressed data, so I'll let others
    who have share their experience. But pig does work with bzip data,
    which can be split. PigStorage checks to see if the input or output
    file ends in .bz, and if so uses bzip to read/write the data. There
    have been some bugs in this code, so you should make sure you have the
    top of trunk version as it's been fixed fairly recently.

    gzip files cannot be split, and if you gzip your whole file, you can't
    really use it with map/reduce or pig. But, hadoop now supports
    compressing each block. As I understand it lzo is preferred over gzip
    for this. But when you use this, it works fine with pig because
    hadoop handles the (de)compression underneath pig. You should be able
    to find info on how to do this on your cluster in the hadoop docs.

    Alan.
    On Mar 14, 2009, at 2:00 AM, Vadim Zaliva wrote:

    I am considering starting to use compression for data files I process
    with PIG. I am using trunk version of PIG
    on Hadoop-0.18.3. Uncompressed files are about 500Mb each, and I plan
    to have total few dozen terabytes of uncompressed data.
    DFS block size I am using is 96Mb.

    I am looking for a feedback on idea of using {B|G}ZIP compressed
    files. Could PIG handle them? How would it affect splittng?
    I have read somewhere that bzip files could be split, whereas gzip
    could not. Could somebody confirm this?

    Thanks!

    Vadim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 14, '09 at 9:00a
activeMar 18, '09 at 11:49a
posts12
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase