FAQ
Hi guys,

We have a lot of data stored inside compressed SEQ files. Since SEQ is
a sequence of (key,value) pairs we are storing set of columns joined
by tab in key part of SEQ, and the same for value part for another set
of columns. So our SEQ files are of type (Text,Text).
Hive cannot understand such files correctly, i.e. I'm not satisfied by
its defaults. What it does - it ignores key part of SEQ, and value
part can deserialize into set of columns successfully.
Can some please point me how to get Hive not ignore SEQ's key?
Thanks.

--
Andrey Pankov

Search Discussions

  • Bobby Rullo at Nov 5, 2009 at 4:57 pm
    I had the exact same question, and Zheng told me I had to implement a
    new FileInputFormat, so I extended SequenceFileInputFormat, and it
    worked out pretty well.

    If you like, I can post the source code somewhere (here?), but it was
    pretty easy.

    Bobby
    On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:

    Hi guys,

    We have a lot of data stored inside compressed SEQ files. Since SEQ is
    a sequence of (key,value) pairs we are storing set of columns joined
    by tab in key part of SEQ, and the same for value part for another set
    of columns. So our SEQ files are of type (Text,Text).
    Hive cannot understand such files correctly, i.e. I'm not satisfied by
    its defaults. What it does - it ignores key part of SEQ, and value
    part can deserialize into set of columns successfully.
    Can some please point me how to get Hive not ignore SEQ's key?
    Thanks.

    --
    Andrey Pankov
  • Andrey Pankov at Nov 5, 2009 at 5:00 pm
    Thanks Bobby. Yeah, could be nice to take a look into your class, just
    to get familiar with. Could you please post at pastebin.com ? Thanks a
    lot!
    On Thu, Nov 5, 2009 at 18:56, Bobby Rullo wrote:
    I had the exact same question, and Zheng told me I had to implement a new
    FileInputFormat, so I extended SequenceFileInputFormat, and it worked out
    pretty well.

    If you like, I can post the source code somewhere (here?), but it was pretty
    easy.

    Bobby
    On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:

    Hi guys,

    We have a lot of data stored inside compressed SEQ files. Since SEQ is
    a sequence of (key,value) pairs we are storing set of columns joined
    by tab in key part of SEQ, and the same for value part for another set
    of columns. So our SEQ files are of type (Text,Text).
    Hive cannot understand such files correctly, i.e. I'm not satisfied by
    its defaults. What it does - it ignores key part of SEQ, and value
    part can deserialize into set of columns successfully.
    Can some please point me how to get Hive not ignore SEQ's key?
    Thanks.

    --
    Andrey Pankov


    --
    Andrey Pankov
  • Bobby Rullo at Nov 5, 2009 at 6:55 pm
    Andrey,

    Here you go:

    http://pastebin.com/m5724ce8a

    Bobby
    On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:

    Thanks Bobby. Yeah, could be nice to take a look into your class, just
    to get familiar with. Could you please post at pastebin.com ? Thanks a
    lot!
    On Thu, Nov 5, 2009 at 18:56, Bobby Rullo wrote:
    I had the exact same question, and Zheng told me I had to implement
    a new
    FileInputFormat, so I extended SequenceFileInputFormat, and it
    worked out
    pretty well.

    If you like, I can post the source code somewhere (here?), but it
    was pretty
    easy.

    Bobby
    On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:

    Hi guys,

    We have a lot of data stored inside compressed SEQ files. Since
    SEQ is
    a sequence of (key,value) pairs we are storing set of columns joined
    by tab in key part of SEQ, and the same for value part for another
    set
    of columns. So our SEQ files are of type (Text,Text).
    Hive cannot understand such files correctly, i.e. I'm not
    satisfied by
    its defaults. What it does - it ignores key part of SEQ, and value
    part can deserialize into set of columns successfully.
    Can some please point me how to get Hive not ignore SEQ's key?
    Thanks.

    --
    Andrey Pankov


    --
    Andrey Pankov
  • Zheng Shao at Nov 5, 2009 at 8:52 pm
    Hi Bobby,

    Can you open a jira and attach a patch?
    We can put that to contrib.

    Zheng

    On 11/5/09, Bobby Rullo wrote:
    Andrey,

    Here you go:

    http://pastebin.com/m5724ce8a

    Bobby
    On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:

    Thanks Bobby. Yeah, could be nice to take a look into your class, just
    to get familiar with. Could you please post at pastebin.com ? Thanks a
    lot!
    On Thu, Nov 5, 2009 at 18:56, Bobby Rullo wrote:
    I had the exact same question, and Zheng told me I had to implement
    a new
    FileInputFormat, so I extended SequenceFileInputFormat, and it
    worked out
    pretty well.

    If you like, I can post the source code somewhere (here?), but it
    was pretty
    easy.

    Bobby
    On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:

    Hi guys,

    We have a lot of data stored inside compressed SEQ files. Since
    SEQ is
    a sequence of (key,value) pairs we are storing set of columns joined
    by tab in key part of SEQ, and the same for value part for another
    set
    of columns. So our SEQ files are of type (Text,Text).
    Hive cannot understand such files correctly, i.e. I'm not
    satisfied by
    its defaults. What it does - it ignores key part of SEQ, and value
    part can deserialize into set of columns successfully.
    Can some please point me how to get Hive not ignore SEQ's key?
    Thanks.

    --
    Andrey Pankov


    --
    Andrey Pankov
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
  • Bobby Rullo at Nov 5, 2009 at 10:19 pm
    Zheng,

    Sure, but it is pretty hacky!

    Bobby
    On Nov 5, 2009, at 12:51 PM, Zheng Shao wrote:

    Hi Bobby,

    Can you open a jira and attach a patch?
    We can put that to contrib.

    Zheng

    On 11/5/09, Bobby Rullo wrote:
    Andrey,

    Here you go:

    http://pastebin.com/m5724ce8a

    Bobby
    On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:

    Thanks Bobby. Yeah, could be nice to take a look into your class,
    just
    to get familiar with. Could you please post at pastebin.com ?
    Thanks a
    lot!
    On Thu, Nov 5, 2009 at 18:56, Bobby Rullo wrote:
    I had the exact same question, and Zheng told me I had to implement
    a new
    FileInputFormat, so I extended SequenceFileInputFormat, and it
    worked out
    pretty well.

    If you like, I can post the source code somewhere (here?), but it
    was pretty
    easy.

    Bobby
    On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:

    Hi guys,

    We have a lot of data stored inside compressed SEQ files. Since
    SEQ is
    a sequence of (key,value) pairs we are storing set of columns
    joined
    by tab in key part of SEQ, and the same for value part for another
    set
    of columns. So our SEQ files are of type (Text,Text).
    Hive cannot understand such files correctly, i.e. I'm not
    satisfied by
    its defaults. What it does - it ignores key part of SEQ, and value
    part can deserialize into set of columns successfully.
    Can some please point me how to get Hive not ignore SEQ's key?
    Thanks.

    --
    Andrey Pankov


    --
    Andrey Pankov
    --
    Sent from Gmail for mobile | mobile.google.com

    Yours,
    Zheng
  • Andrey Pankov at Nov 6, 2009 at 2:51 pm
    Thanks Bobby, you saved my time.
    On Thu, Nov 5, 2009 at 20:54, Bobby Rullo wrote:
    Andrey,

    Here you go:

    http://pastebin.com/m5724ce8a

    Bobby
    On Nov 5, 2009, at 8:59 AM, Andrey Pankov wrote:

    Thanks Bobby. Yeah, could be nice to take a look into your class, just
    to get familiar with. Could you please post at pastebin.com ? Thanks a
    lot!
    On Thu, Nov 5, 2009 at 18:56, Bobby Rullo wrote:

    I had the exact same question, and Zheng told me I had to implement a new
    FileInputFormat, so I extended SequenceFileInputFormat, and it worked out
    pretty well.

    If you like, I can post the source code somewhere (here?), but it was
    pretty
    easy.

    Bobby
    On Nov 5, 2009, at 8:20 AM, Andrey Pankov wrote:

    Hi guys,

    We have a lot of data stored inside compressed SEQ files. Since SEQ is
    a sequence of (key,value) pairs we are storing set of columns joined
    by tab in key part of SEQ, and the same for value part for another set
    of columns. So our SEQ files are of type (Text,Text).
    Hive cannot understand such files correctly, i.e. I'm not satisfied by
    its defaults. What it does - it ignores key part of SEQ, and value
    part can deserialize into set of columns successfully.
    Can some please point me how to get Hive not ignore SEQ's key?
    Thanks.

    --
    Andrey Pankov


    --
    Andrey Pankov


    --
    Andrey Pankov

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedNov 5, '09 at 4:20p
activeNov 6, '09 at 2:51p
posts7
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase