Grokbase Groups Hive user March 2011
FAQ
Hi,

I am working on building a MR job that generates RCFiles that will become partitions of a hive table. I have most of it working however only strings (Text) are being deserialized inside of Hive. The hive table is specified to use a columnarserde which I thought should allow the writable types stored in the RCFile to be deserialized properly.

Currently all numeric types (IntWritable and LongWritable) come back a null.

Has anyone else seen anything like this or have any ideas? I would rather not convert all my data to strings to use RCFile.

Thanks.

Steve

Search Discussions

  • Yongqiang he at Mar 17, 2011 at 11:05 pm
    You need to customize Hive's ColumnarSerde (maybe functions in
    LazySerde)'s serde and deserialize function (depends you want to read
    or write.). And the main thing is that you need to use your own type
    def (not LazyInt/LazyLong).

    If your type is int or long (not double/float), casting it to string
    only wastes some CPU, but can save you more spaces.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve wrote:
    Hi,



    I am working on building a MR job that generates RCFiles that will become
    partitions of a hive table. I have most of it working however only strings
    (Text) are being deserialized inside of Hive. The hive table is specified to
    use a columnarserde which I thought should allow the writable types stored
    in the RCFile to be deserialized properly.



    Currently all numeric types (IntWritable and LongWritable) come back a null.



    Has anyone else seen anything like this or have any ideas? I would rather
    not convert all my data to strings to use RCFile.



    Thanks.



    Steve
  • Yongqiang he at Mar 17, 2011 at 11:35 pm
    A side note, in hive, we make all columns saved as Text internally
    (even the column's type is int or double etc). And with some
    experiments, string is more friendly to compression. But it needs CPU
    to decode to its original type.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he wrote:
    You need to customize Hive's ColumnarSerde (maybe functions in
    LazySerde)'s serde and deserialize function (depends you want to read
    or write.). And the main thing is that you need to use your own type
    def (not LazyInt/LazyLong).

    If your type is int or long (not double/float), casting it to string
    only wastes some CPU, but can save you more spaces.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve wrote:
    Hi,



    I am working on building a MR job that generates RCFiles that will become
    partitions of a hive table. I have most of it working however only strings
    (Text) are being deserialized inside of Hive. The hive table is specified to
    use a columnarserde which I thought should allow the writable types stored
    in the RCFile to be deserialized properly.



    Currently all numeric types (IntWritable and LongWritable) come back a null.



    Has anyone else seen anything like this or have any ideas? I would rather
    not convert all my data to strings to use RCFile.



    Thanks.



    Steve
  • Severance, Steve at Mar 17, 2011 at 11:51 pm
    Thanks Yongqiang.

    So for more complex types like map do I just setup a

    ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 4:35 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    A side note, in hive, we make all columns saved as Text internally
    (even the column's type is int or double etc). And with some
    experiments, string is more friendly to compression. But it needs CPU
    to decode to its original type.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he wrote:
    You need to customize Hive's ColumnarSerde (maybe functions in
    LazySerde)'s serde and deserialize function (depends you want to read
    or write.). And the main thing is that you need to use your own type
    def (not LazyInt/LazyLong).

    If your type is int or long (not double/float), casting it to string
    only wastes some CPU, but can save you more spaces.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve wrote:
    Hi,



    I am working on building a MR job that generates RCFiles that will become
    partitions of a hive table. I have most of it working however only strings
    (Text) are being deserialized inside of Hive. The hive table is specified to
    use a columnarserde which I thought should allow the writable types stored
    in the RCFile to be deserialized properly.



    Currently all numeric types (IntWritable and LongWritable) come back a null.



    Has anyone else seen anything like this or have any ideas? I would rather
    not convert all my data to strings to use RCFile.



    Thanks.



    Steve
  • Yongqiang he at Mar 18, 2011 at 12:09 am
    Yes. It is the same with normal hive tables.

    thanks
    yongqiang
    On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve wrote:
    Thanks Yongqiang.

    So for more complex types like map do I just setup a

    ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 4:35 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    A side note, in hive, we make all columns saved as Text internally
    (even the column's type is int or double etc). And with some
    experiments, string is more friendly to compression. But it needs CPU
    to decode to its original type.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he wrote:
    You need to customize Hive's ColumnarSerde (maybe functions in
    LazySerde)'s serde and deserialize function (depends you want to read
    or write.). And the main thing is that you need to use your own type
    def (not LazyInt/LazyLong).

    If your type is int or long (not double/float), casting it to string
    only wastes some CPU, but can save you more spaces.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve wrote:
    Hi,



    I am working on building a MR job that generates RCFiles that will become
    partitions of a hive table. I have most of it working however only strings
    (Text) are being deserialized inside of Hive. The hive table is specified to
    use a columnarserde which I thought should allow the writable types stored
    in the RCFile to be deserialized properly.



    Currently all numeric types (IntWritable and LongWritable) come back a null.



    Has anyone else seen anything like this or have any ideas? I would rather
    not convert all my data to strings to use RCFile.



    Thanks.



    Steve
  • Severance, Steve at Mar 18, 2011 at 10:30 pm
    One more question. I have everything working except a Map<String,String>.

    I understand that the whole Map will be physically stored as a single Text object in the RCFile.

    I have had considerable trouble setting up the delimiters for this Map.

    I want to have
    MAP KEYS TERMINATED BY '='
    COLLECTION ITEMS TERMINATED BY '&'

    Hive doesn't seem to want to take that. I have also tried using the ascii OCT codes.

    What do I need to setup to make this Map work?

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 5:09 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    Yes. It is the same with normal hive tables.

    thanks
    yongqiang
    On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve wrote:
    Thanks Yongqiang.

    So for more complex types like map do I just setup a

    ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 4:35 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    A side note, in hive, we make all columns saved as Text internally
    (even the column's type is int or double etc). And with some
    experiments, string is more friendly to compression. But it needs CPU
    to decode to its original type.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he wrote:
    You need to customize Hive's ColumnarSerde (maybe functions in
    LazySerde)'s serde and deserialize function (depends you want to read
    or write.). And the main thing is that you need to use your own type
    def (not LazyInt/LazyLong).

    If your type is int or long (not double/float), casting it to string
    only wastes some CPU, but can save you more spaces.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve wrote:
    Hi,



    I am working on building a MR job that generates RCFiles that will become
    partitions of a hive table. I have most of it working however only strings
    (Text) are being deserialized inside of Hive. The hive table is specified to
    use a columnarserde which I thought should allow the writable types stored
    in the RCFile to be deserialized properly.



    Currently all numeric types (IntWritable and LongWritable) come back a null.



    Has anyone else seen anything like this or have any ideas? I would rather
    not convert all my data to strings to use RCFile.



    Thanks.



    Steve
  • Yongqiang he at Mar 18, 2011 at 10:50 pm
    what's your table definition?

    http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

    See ROW FORMAT


    Thanks
    Yongqiang
    On Fri, Mar 18, 2011 at 3:33 PM, Severance, Steve wrote:
    One more question. I have everything working except a Map<String,String>.

    I understand that the whole Map will be physically stored as a single Text object in the RCFile.

    I have had considerable trouble setting up the delimiters for this Map.

    I want to have
    MAP KEYS TERMINATED BY '='
    COLLECTION ITEMS TERMINATED BY '&'

    Hive doesn't seem to want to take that. I have also tried using the ascii OCT codes.

    What do I need to setup to make this Map work?

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 5:09 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    Yes. It is the same with normal hive tables.

    thanks
    yongqiang
    On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve wrote:
    Thanks Yongqiang.

    So for more complex types like map do I just setup a

    ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 4:35 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    A side note, in hive, we make all columns saved as Text internally
    (even the column's type is int or double etc). And with some
    experiments, string is more friendly to compression. But it needs CPU
    to decode to its original type.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he wrote:
    You need to customize Hive's ColumnarSerde (maybe functions in
    LazySerde)'s serde and deserialize function (depends you want to read
    or write.). And the main thing is that you need to use your own type
    def (not LazyInt/LazyLong).

    If your type is int or long (not double/float), casting it to string
    only wastes some CPU, but can save you more spaces.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve wrote:
    Hi,



    I am working on building a MR job that generates RCFiles that will become
    partitions of a hive table. I have most of it working however only strings
    (Text) are being deserialized inside of Hive. The hive table is specified to
    use a columnarserde which I thought should allow the writable types stored
    in the RCFile to be deserialized properly.



    Currently all numeric types (IntWritable and LongWritable) come back a null.



    Has anyone else seen anything like this or have any ideas? I would rather
    not convert all my data to strings to use RCFile.



    Thanks.



    Steve
  • Severance, Steve at Mar 19, 2011 at 4:02 am
    Got it working using the columnar serde with the default seperators.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Friday, March 18, 2011 3:50 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    what's your table definition?

    http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

    See ROW FORMAT


    Thanks
    Yongqiang
    On Fri, Mar 18, 2011 at 3:33 PM, Severance, Steve wrote:
    One more question. I have everything working except a Map<String,String>.

    I understand that the whole Map will be physically stored as a single Text object in the RCFile.

    I have had considerable trouble setting up the delimiters for this Map.

    I want to have
    MAP KEYS TERMINATED BY '='
    COLLECTION ITEMS TERMINATED BY '&'

    Hive doesn't seem to want to take that. I have also tried using the ascii OCT codes.

    What do I need to setup to make this Map work?

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 5:09 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    Yes. It is the same with normal hive tables.

    thanks
    yongqiang
    On Thu, Mar 17, 2011 at 4:54 PM, Severance, Steve wrote:
    Thanks Yongqiang.

    So for more complex types like map do I just setup a

    ROW FORMAT DELIMITED KEYS TERMINATED BY '|' etc...

    Thanks.

    Steve

    -----Original Message-----
    From: yongqiang he
    Sent: Thursday, March 17, 2011 4:35 PM
    To: user@hive.apache.org
    Subject: Re: Building Custom RCFiles

    A side note, in hive, we make all columns saved as Text internally
    (even the column's type is int or double etc). And with some
    experiments, string is more friendly to compression. But it needs CPU
    to decode to its original type.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 4:04 PM, yongqiang he wrote:
    You need to customize Hive's ColumnarSerde (maybe functions in
    LazySerde)'s serde and deserialize function (depends you want to read
    or write.). And the main thing is that you need to use your own type
    def (not LazyInt/LazyLong).

    If your type is int or long (not double/float), casting it to string
    only wastes some CPU, but can save you more spaces.

    Thanks
    Yongqiang
    On Thu, Mar 17, 2011 at 3:48 PM, Severance, Steve wrote:
    Hi,



    I am working on building a MR job that generates RCFiles that will become
    partitions of a hive table. I have most of it working however only strings
    (Text) are being deserialized inside of Hive. The hive table is specified to
    use a columnarserde which I thought should allow the writable types stored
    in the RCFile to be deserialized properly.



    Currently all numeric types (IntWritable and LongWritable) come back a null.



    Has anyone else seen anything like this or have any ideas? I would rather
    not convert all my data to strings to use RCFile.



    Thanks.



    Steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMar 17, '11 at 10:45p
activeMar 19, '11 at 4:02a
posts8
users2
websitehive.apache.org

2 users in discussion

Yongqiang he: 4 posts Severance, Steve: 4 posts

People

Translate

site design / logo © 2022 Grokbase