FAQ
Hi,
I've written a SerDe and I'd like it to be able handle compressed data (gzip). Hadoop detects and decompresses on the fly so if you have a compressed data set and you don't need to perform any custom interpretation of it as you go, hadoop and hive will handle it. Is there a way to get Hive to notice the data is compressed, decompress it then push it through the custom SerDe? Or will I have to either
a. add some decompression logic to my SerDe (possibly impossible)
b. decompress the data before pushing it into a table with my SerDe

Thanks!

Pat

Search Discussions

  • Phil young at Jan 28, 2011 at 9:52 pm
    This can be accomplished with a custom input format.

    Here's a snippet of the relevant code in the customer RecordReader




    compressionCodecs = new CompressionCodecFactory(jobConf);

    Path file = split.getPath();

    final CompressionCodec codec = compressionCodecs.getCodec(file);

    // open the file and seek to the start of the split

    start = split.getStart();

    end = start + split.getLength();

    pos=0;


    FileSystem fs = file.getFileSystem(jobConf);

    fsdat = fs.open(split.getPath());

    fsdat.seek(start);


    if (codec != null)

    {

    fsin = codec.createInputStream(fsdat);

    }

    else

    {

    fsin = fsdat;

    }





    On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat wrote:

    Hi,

    I’ve written a SerDe and I’d like it to be able handle compressed data
    (gzip). Hadoop detects and decompresses on the fly so if you have a
    compressed data set and you don’t need to perform any custom interpretation
    of it as you go, hadoop and hive will handle it. Is there a way to get Hive
    to notice the data is compressed, decompress it then push it through the
    custom SerDe? Or will I have to either

    a. add some decompression logic to my SerDe (possibly impossible)

    b. decompress the data before pushing it into a table with my SerDe



    Thanks!



    Pat
  • Phil young at Jan 28, 2011 at 9:54 pm
    To be clear, you would then create the table with the clause:

    STORED AS
    INPUTFORMAT 'your.custom.input.format'


    If you make an external table, you'll then be able to point to a directory
    (or file) that contains gzipped files, or uncompressed files.


    On Fri, Jan 28, 2011 at 4:52 PM, phil young wrote:

    This can be accomplished with a custom input format.

    Here's a snippet of the relevant code in the customer RecordReader




    compressionCodecs = new CompressionCodecFactory(jobConf);

    Path file = split.getPath();

    final CompressionCodec codec = compressionCodecs
    .getCodec(file);

    // open the file and seek to the start of the split

    start = split.getStart();

    end = start + split.getLength();

    pos=0;


    FileSystem fs = file.getFileSystem(jobConf);

    fsdat = fs.open(split.getPath());

    fsdat.seek(start);


    if (codec != null)

    {

    fsin = codec.createInputStream(fsdat);

    }

    else

    {

    fsin = fsdat;

    }






    On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat <
    patrick.christopher@hp.com> wrote:
    Hi,

    I’ve written a SerDe and I’d like it to be able handle compressed data
    (gzip). Hadoop detects and decompresses on the fly so if you have a
    compressed data set and you don’t need to perform any custom interpretation
    of it as you go, hadoop and hive will handle it. Is there a way to get Hive
    to notice the data is compressed, decompress it then push it through the
    custom SerDe? Or will I have to either

    a. add some decompression logic to my SerDe (possibly impossible)

    b. decompress the data before pushing it into a table with my SerDe



    Thanks!



    Pat
  • Christopher, Pat at Jan 28, 2011 at 10:36 pm
    Not sure what I did wrong the first time but I tried to create a table with stored type of textfile and using my custom serde so it had a format line of:

    ROW FORMAT SERDE 'org.myorg.hadoop.hive.udf.MySerDe' STORED AS textfile

    Then I loaded a gzipped file using LOAD DATA LOCAL INPATH 'path.gz' INTO TABLE mytable and it worked as expected, ie the file was read and I'm able to query it using hive.

    Sorry to bother and thanks a bunch for the help! Forcing me to go read more about InputFormats is a long term help anyway.

    Pat

    From: phil young
    Sent: Friday, January 28, 2011 1:54 PM
    To: user@hive.apache.org
    Subject: Re: Custom SerDe Question

    To be clear, you would then create the table with the clause:

    STORED AS
    INPUTFORMAT 'your.custom.input.format'


    If you make an external table, you'll then be able to point to a directory (or file) that contains gzipped files, or uncompressed files.



    On Fri, Jan 28, 2011 at 4:52 PM, phil young wrote:
    This can be accomplished with a custom input format.

    Here's a snippet of the relevant code in the customer RecordReader





    compressionCodecs = new CompressionCodecFactory(jobConf);

    Path file = split.getPath();

    final CompressionCodec codec = compressionCodecs.getCodec(file);

    // open the file and seek to the start of the split

    start = split.getStart();

    end = start + split.getLength();

    pos=0;



    FileSystem fs = file.getFileSystem(jobConf);

    fsdat = fs.open(split.getPath());

    fsdat.seek(start);



    if (codec != null)

    {

    fsin = codec.createInputStream(fsdat);

    }

    else

    {

    fsin = fsdat;

    }








    On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat wrote:
    Hi,
    I've written a SerDe and I'd like it to be able handle compressed data (gzip). Hadoop detects and decompresses on the fly so if you have a compressed data set and you don't need to perform any custom interpretation of it as you go, hadoop and hive will handle it. Is there a way to get Hive to notice the data is compressed, decompress it then push it through the custom SerDe? Or will I have to either
    a. add some decompression logic to my SerDe (possibly impossible)
    b. decompress the data before pushing it into a table with my SerDe

    Thanks!

    Pat
  • Phil young at Jan 28, 2011 at 11:00 pm
    Ahh, not as custom as I expected...that makes sense now.

    Glad things are working for you.

    -Phil

    On Fri, Jan 28, 2011 at 5:34 PM, Christopher, Pat wrote:

    Not sure what I did wrong the first time but I tried to create a table with
    stored type of textfile and using my custom serde so it had a format line
    of:



    ROW FORMAT SERDE ‘org.myorg.hadoop.hive.udf.MySerDe’ STORED AS textfile



    Then I loaded a gzipped file using LOAD DATA LOCAL INPATH ‘path.gz’ INTO
    TABLE mytable and it worked as expected, ie the file was read and I’m able
    to query it using hive.



    Sorry to bother and thanks a bunch for the help! Forcing me to go read
    more about InputFormats is a long term help anyway.



    Pat



    *From:* phil young
    *Sent:* Friday, January 28, 2011 1:54 PM
    *To:* user@hive.apache.org
    *Subject:* Re: Custom SerDe Question



    To be clear, you would then create the table with the clause:



    STORED AS

    INPUTFORMAT 'your.custom.input.format'





    If you make an external table, you'll then be able to point to a directory
    (or file) that contains gzipped files, or uncompressed files.







    On Fri, Jan 28, 2011 at 4:52 PM, phil young wrote:

    This can be accomplished with a custom input format.



    Here's a snippet of the relevant code in the customer RecordReader





    compressionCodecs = new CompressionCodecFactory(jobConf);

    Path file = split.getPath();

    final CompressionCodec codec =
    compressionCodecs.getCodec(file);

    // open the file and seek to the start of the split

    start = split.getStart();

    end = start + split.getLength();

    pos=0;



    FileSystem fs = file.getFileSystem(jobConf);

    fsdat = fs.open(split.getPath());

    fsdat.seek(start);



    if (codec != null)

    {

    fsin = codec.createInputStream(fsdat);

    }

    else

    {

    fsin = fsdat;

    }











    On Fri, Jan 28, 2011 at 1:57 PM, Christopher, Pat <
    patrick.christopher@hp.com> wrote:

    Hi,

    I’ve written a SerDe and I’d like it to be able handle compressed data
    (gzip). Hadoop detects and decompresses on the fly so if you have a
    compressed data set and you don’t need to perform any custom interpretation
    of it as you go, hadoop and hive will handle it. Is there a way to get Hive
    to notice the data is compressed, decompress it then push it through the
    custom SerDe? Or will I have to either

    a. add some decompression logic to my SerDe (possibly impossible)

    b. decompress the data before pushing it into a table with my SerDe



    Thanks!



    Pat



Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJan 28, '11 at 6:59p
activeJan 28, '11 at 11:00p
posts5
users2
websitehive.apache.org

2 users in discussion

Phil young: 3 posts Christopher, Pat: 2 posts

People

Translate

site design / logo © 2021 Grokbase