FAQ
I have been working my way through Pig recently with a lot of help from the
folks in #hadoop-pig on Freenode.

The problem I am having is with reading any gzip'd files from anywhere
(either locally or from s3). This is the case with pig in local mode. I am
using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have
checked my core-site.xml and I have the following line for compression
codecs:

<property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>

Gzip is listed there so I don't know why it won't decode properly. I am
trying to do the following as a test:

--
Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
foo = LIMIT Y 5;
dump foo
(?ks?F?6?)

Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
foo = LIMIT Y 5;
dump foo
(?ks?F?6?)
--

Both yield the same results. What I am actually trying to parse is
compressed JSON. And up to this point dmitriy has helped me and the JSON
loads and the scripts run perfectly as long as the logs are not compressed.
Since the logs are compressed, my hands are tied. Any suggestions to get
me moving in the right direction? Thanks.

-e
--
Eric Lubow
e: eric.lubow@gmail.com
w: eric.lubow.org

Search Discussions

  • Charles Gonçalves at Feb 22, 2011 at 2:29 am
    I'm not sure if is the same problem.

    I did a custom loader and I got a problem reading compressed files too.
    So I noticed that in the PigStorage the function getInputFormat was:

    public InputFormat getInputFormat() throws IOException {
    if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz")) {
    return new Bzip2TextInputFormat();
    } else {
    return new PigTextInputFormat();
    }
    }

    And in my custom loader was :

    public InputFormat getInputFormat() {
    return new TextInputFormat();
    }


    I just copied the code from PigStorage and everything went right


    On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow wrote:

    I have been working my way through Pig recently with a lot of help from the
    folks in #hadoop-pig on Freenode.

    The problem I am having is with reading any gzip'd files from anywhere
    (either locally or from s3). This is the case with pig in local mode. I
    am
    using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have
    checked my core-site.xml and I have the following line for compression
    codecs:


    <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>

    Gzip is listed there so I don't know why it won't decode properly. I am
    trying to do the following as a test:

    --
    Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)

    Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)
    --

    Both yield the same results. What I am actually trying to parse is
    compressed JSON. And up to this point dmitriy has helped me and the JSON
    loads and the scripts run perfectly as long as the logs are not compressed.
    Since the logs are compressed, my hands are tied. Any suggestions to get
    me moving in the right direction? Thanks.

    -e
    --
    Eric Lubow
    e: eric.lubow@gmail.com
    w: eric.lubow.org


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Dmitriy Ryaboy at Feb 22, 2011 at 2:36 am
    He's on 0.6, so the interface is different. And for him even PigStorage
    doesn't decompress...

    It occurs to me the problem may be with underlying fs. Eric, what happens
    when you try reading out of a normal HDFS (you can just run a
    pseudo-distributed cluster locally to test)?

    D
    On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves wrote:

    I'm not sure if is the same problem.

    I did a custom loader and I got a problem reading compressed files too.
    So I noticed that in the PigStorage the function getInputFormat was:

    public InputFormat getInputFormat() throws IOException {
    if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz")) {
    return new Bzip2TextInputFormat();
    } else {
    return new PigTextInputFormat();
    }
    }

    And in my custom loader was :

    public InputFormat getInputFormat() {
    return new TextInputFormat();
    }


    I just copied the code from PigStorage and everything went right


    On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow wrote:

    I have been working my way through Pig recently with a lot of help from the
    folks in #hadoop-pig on Freenode.

    The problem I am having is with reading any gzip'd files from anywhere
    (either locally or from s3). This is the case with pig in local mode. I
    am
    using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have
    checked my core-site.xml and I have the following line for compression
    codecs:


    <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
    Gzip is listed there so I don't know why it won't decode properly. I am
    trying to do the following as a test:

    --
    Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)

    Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)
    --

    Both yield the same results. What I am actually trying to parse is
    compressed JSON. And up to this point dmitriy has helped me and the JSON
    loads and the scripts run perfectly as long as the logs are not
    compressed.
    Since the logs are compressed, my hands are tied. Any suggestions to get
    me moving in the right direction? Thanks.

    -e
    --
    Eric Lubow
    e: eric.lubow@gmail.com
    w: eric.lubow.org


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
  • Eric Lubow at Feb 22, 2011 at 1:19 pm
    I'm not sure what you mean by testing it directly out of a normal HDFS. I
    have added it to HDFS with 'hadoop fs copyFromLocal', but then I can't
    access it via Pig using the file:///. Am I doing something wrong or are you
    asking me to try something else?

    -e
    On Mon, Feb 21, 2011 at 21:36, Dmitriy Ryaboy wrote:

    He's on 0.6, so the interface is different. And for him even PigStorage
    doesn't decompress...

    It occurs to me the problem may be with underlying fs. Eric, what happens
    when you try reading out of a normal HDFS (you can just run a
    pseudo-distributed cluster locally to test)?

    D

    On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves wrote:

    I'm not sure if is the same problem.

    I did a custom loader and I got a problem reading compressed files too.
    So I noticed that in the PigStorage the function getInputFormat was:

    public InputFormat getInputFormat() throws IOException {
    if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz")) {
    return new Bzip2TextInputFormat();
    } else {
    return new PigTextInputFormat();
    }
    }

    And in my custom loader was :

    public InputFormat getInputFormat() {
    return new TextInputFormat();
    }


    I just copied the code from PigStorage and everything went right


    On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow wrote:

    I have been working my way through Pig recently with a lot of help from the
    folks in #hadoop-pig on Freenode.

    The problem I am having is with reading any gzip'd files from anywhere
    (either locally or from s3). This is the case with pig in local mode. I
    am
    using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have
    checked my core-site.xml and I have the following line for compression
    codecs:


    <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
    Gzip is listed there so I don't know why it won't decode properly. I am
    trying to do the following as a test:

    --
    Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)

    Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)
    --

    Both yield the same results. What I am actually trying to parse is
    compressed JSON. And up to this point dmitriy has helped me and the JSON
    loads and the scripts run perfectly as long as the logs are not
    compressed.
    Since the logs are compressed, my hands are tied. Any suggestions to get
    me moving in the right direction? Thanks.

    -e
    --
    Eric Lubow
    e: eric.lubow@gmail.com
    w: eric.lubow.org


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org
  • Eric Lubow at Feb 22, 2011 at 1:22 pm
    I apologize for the double mailing:

    grunt> Y = LOAD 'hdfs:///mnt/test.log.gz' AS (line:chararray);
    grunt> foo = LIMIT Y 5;
    grunt> dump foo
    <0\Mtest.log?]?o?H??}?)

    It didn't work out of HDFS.

    -e
    On Tue, Feb 22, 2011 at 08:18, Eric Lubow wrote:

    I'm not sure what you mean by testing it directly out of a normal HDFS. I
    have added it to HDFS with 'hadoop fs copyFromLocal', but then I can't
    access it via Pig using the file:///. Am I doing something wrong or are you
    asking me to try something else?

    -e

    On Mon, Feb 21, 2011 at 21:36, Dmitriy Ryaboy wrote:

    He's on 0.6, so the interface is different. And for him even PigStorage
    doesn't decompress...

    It occurs to me the problem may be with underlying fs. Eric, what happens
    when you try reading out of a normal HDFS (you can just run a
    pseudo-distributed cluster locally to test)?

    D

    On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves wrote:

    I'm not sure if is the same problem.

    I did a custom loader and I got a problem reading compressed files too.
    So I noticed that in the PigStorage the function getInputFormat was:

    public InputFormat getInputFormat() throws IOException {
    if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz"))
    {
    return new Bzip2TextInputFormat();
    } else {
    return new PigTextInputFormat();
    }
    }

    And in my custom loader was :

    public InputFormat getInputFormat() {
    return new TextInputFormat();
    }


    I just copied the code from PigStorage and everything went right



    On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow <eric.lubow@gmail.com>
    wrote:
    I have been working my way through Pig recently with a lot of help from the
    folks in #hadoop-pig on Freenode.

    The problem I am having is with reading any gzip'd files from anywhere
    (either locally or from s3). This is the case with pig in local mode. I
    am
    using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have
    checked my core-site.xml and I have the following line for compression
    codecs:


    <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
    Gzip is listed there so I don't know why it won't decode properly. I am
    trying to do the following as a test:

    --
    Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)

    Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)
    --

    Both yield the same results. What I am actually trying to parse is
    compressed JSON. And up to this point dmitriy has helped me and the JSON
    loads and the scripts run perfectly as long as the logs are not
    compressed.
    Since the logs are compressed, my hands are tied. Any suggestions to get
    me moving in the right direction? Thanks.

    -e
    --
    Eric Lubow
    e: eric.lubow@gmail.com
    w: eric.lubow.org


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org
  • Jacob Perkins at Feb 22, 2011 at 2:01 pm
    Here's what I just tried:

    I gzipped a file:

    'cat foo.tsv | gzip > foo.tsv.gz'

    Uploaded to my hdfs (hdfs://master:8020)

    'hadoop fs -put foo.tsv.gz /tmp'

    Then loaded it and dumped it with pig:

    grunt> data = LOAD 'hdfs://master/tmp/foo.tsv.gz';
    grunt> DUMP data;
    (98384,559)
    (98385,587)
    (98386,573)
    (98387,587)
    (98388,589)
    (98389,584)
    (98390,572)
    (98391,567)

    Looks great. I'm going to blame it on your version? I'm using pig-0.8
    and hadoop 0.20.2.

    --jacob
    @thedatachef

    On Tue, 2011-02-22 at 08:21 -0500, Eric Lubow wrote:
    I apologize for the double mailing:

    grunt> Y = LOAD 'hdfs:///mnt/test.log.gz' AS (line:chararray);
    grunt> foo = LIMIT Y 5;
    grunt> dump foo
    <0\Mtest.log?]?o?H??}?)

    It didn't work out of HDFS.

    -e
    On Tue, Feb 22, 2011 at 08:18, Eric Lubow wrote:

    I'm not sure what you mean by testing it directly out of a normal HDFS. I
    have added it to HDFS with 'hadoop fs copyFromLocal', but then I can't
    access it via Pig using the file:///. Am I doing something wrong or are you
    asking me to try something else?

    -e

    On Mon, Feb 21, 2011 at 21:36, Dmitriy Ryaboy wrote:

    He's on 0.6, so the interface is different. And for him even PigStorage
    doesn't decompress...

    It occurs to me the problem may be with underlying fs. Eric, what happens
    when you try reading out of a normal HDFS (you can just run a
    pseudo-distributed cluster locally to test)?

    D

    On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves wrote:

    I'm not sure if is the same problem.

    I did a custom loader and I got a problem reading compressed files too.
    So I noticed that in the PigStorage the function getInputFormat was:

    public InputFormat getInputFormat() throws IOException {
    if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz"))
    {
    return new Bzip2TextInputFormat();
    } else {
    return new PigTextInputFormat();
    }
    }

    And in my custom loader was :

    public InputFormat getInputFormat() {
    return new TextInputFormat();
    }


    I just copied the code from PigStorage and everything went right



    On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow <eric.lubow@gmail.com>
    wrote:
    I have been working my way through Pig recently with a lot of help from the
    folks in #hadoop-pig on Freenode.

    The problem I am having is with reading any gzip'd files from anywhere
    (either locally or from s3). This is the case with pig in local mode. I
    am
    using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have
    checked my core-site.xml and I have the following line for compression
    codecs:


    <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
    Gzip is listed there so I don't know why it won't decode properly. I am
    trying to do the following as a test:

    --
    Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)

    Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)
    --

    Both yield the same results. What I am actually trying to parse is
    compressed JSON. And up to this point dmitriy has helped me and the JSON
    loads and the scripts run perfectly as long as the logs are not
    compressed.
    Since the logs are compressed, my hands are tied. Any suggestions to get
    me moving in the right direction? Thanks.

    -e
    --
    Eric Lubow
    e: eric.lubow@gmail.com
    w: eric.lubow.org


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org
  • Eric Lubow at Feb 22, 2011 at 2:23 pm
    I think I figured out the problem. It occurred to me that I kept doing pig
    -x local even when I using things from S3. As soon as I dropped the "-x
    local" and tried pulling in a gzip file, it started to work. I'm not sure
    if this is intended behavior, but either way, problem solved. Gzip doesn't
    work in Pig local mode. Thanks all.

    -e
    On Tue, Feb 22, 2011 at 09:00, Jacob Perkins wrote:

    Here's what I just tried:

    I gzipped a file:

    'cat foo.tsv | gzip > foo.tsv.gz'

    Uploaded to my hdfs (hdfs://master:8020)

    'hadoop fs -put foo.tsv.gz /tmp'

    Then loaded it and dumped it with pig:

    grunt> data = LOAD 'hdfs://master/tmp/foo.tsv.gz';
    grunt> DUMP data;
    (98384,559)
    (98385,587)
    (98386,573)
    (98387,587)
    (98388,589)
    (98389,584)
    (98390,572)
    (98391,567)

    Looks great. I'm going to blame it on your version? I'm using pig-0.8
    and hadoop 0.20.2.

    --jacob
    @thedatachef

    On Tue, 2011-02-22 at 08:21 -0500, Eric Lubow wrote:
    I apologize for the double mailing:

    grunt> Y = LOAD 'hdfs:///mnt/test.log.gz' AS (line:chararray);
    grunt> foo = LIMIT Y 5;
    grunt> dump foo
    <0\Mtest.log?]?o?H??}?)

    It didn't work out of HDFS.

    -e
    On Tue, Feb 22, 2011 at 08:18, Eric Lubow wrote:

    I'm not sure what you mean by testing it directly out of a normal HDFS.
    I
    have added it to HDFS with 'hadoop fs copyFromLocal', but then I can't
    access it via Pig using the file:///. Am I doing something wrong or
    are you
    asking me to try something else?

    -e

    On Mon, Feb 21, 2011 at 21:36, Dmitriy Ryaboy wrote:

    He's on 0.6, so the interface is different. And for him even
    PigStorage
    doesn't decompress...

    It occurs to me the problem may be with underlying fs. Eric, what
    happens
    when you try reading out of a normal HDFS (you can just run a
    pseudo-distributed cluster locally to test)?

    D


    On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves <
    charles.fg@gmail.com>wrote:
    I'm not sure if is the same problem.

    I did a custom loader and I got a problem reading compressed files
    too.
    So I noticed that in the PigStorage the function getInputFormat
    was:
    public InputFormat getInputFormat() throws IOException {
    if(loadLocation.endsWith(".bz2") ||
    loadLocation.endsWith(".bz"))
    {
    return new Bzip2TextInputFormat();
    } else {
    return new PigTextInputFormat();
    }
    }

    And in my custom loader was :

    public InputFormat getInputFormat() {
    return new TextInputFormat();
    }


    I just copied the code from PigStorage and everything went right



    On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow <eric.lubow@gmail.com>
    wrote:
    I have been working my way through Pig recently with a lot of help
    from
    the
    folks in #hadoop-pig on Freenode.

    The problem I am having is with reading any gzip'd files from
    anywhere
    (either locally or from s3). This is the case with pig in local
    mode.
    I
    am
    using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I
    have
    checked my core-site.xml and I have the following line for
    compression
    codecs:

    <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
    Gzip is listed there so I don't know why it won't decode properly.
    I
    am
    trying to do the following as a test:

    --
    Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)

    Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS
    (line:chararray);
    foo = LIMIT Y 5;
    dump foo
    (?ks?F?6?)
    --

    Both yield the same results. What I am actually trying to parse is
    compressed JSON. And up to this point dmitriy has helped me and
    the
    JSON
    loads and the scripts run perfectly as long as the logs are not
    compressed.
    Since the logs are compressed, my hands are tied. Any suggestions
    to
    get
    me moving in the right direction? Thanks.

    -e
    --
    Eric Lubow
    e: eric.lubow@gmail.com
    w: eric.lubow.org


    --
    *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.: 55 31 34741485
    Lab.: 55 31 34095840
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org
    Eric Lubow e: eric.lubow@gmail.com w: eric.lubow.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 21, '11 at 11:47p
activeFeb 22, '11 at 2:23p
posts7
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase