Grokbase Groups Hive user July 2011
FAQ
We have database dumps with TAB delimiters. The fields with TAB have them escaped in the dumped text file. But HIVE does not respect escaped delimiters so
create external table scratch.delete_me (a int, b int, c bigint, d string, e int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/tmp/users';


creates rows with value of e as NULL for some rows.

Hive also does not allow multi-character delimiters for ROW FORMAT DELIMITED spec.

What is the cleanest way to get past this problem? Options are:
1. Write custom SerDe class
2. Use RegexSerde
3. Remove escaped delimiter chars from data

I need to know the roadblocks before I invest time on any one of them.

-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.

Search Discussions

  • Edward Capriolo at Jul 26, 2011 at 9:38 pm

    On Tue, Jul 26, 2011 at 5:13 PM, Ayon Sinha wrote:

    We have database dumps with TAB delimiters. The fields with TAB have them
    escaped in the dumped text file. But HIVE does not respect escaped
    delimiters so
    create external table scratch.delete_me (a int, b int, c bigint, d string,
    e int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE
    LOCATION '/tmp/users';

    creates rows with value of e as NULL for some rows.

    Hive also does not allow multi-character delimiters for ROW FORMAT
    DELIMITED spec.

    What is the cleanest way to get past this problem? Options are:
    1. Write custom SerDe class
    2. Use RegexSerde
    3. Remove escaped delimiter chars from data

    I need to know the roadblocks before I invest time on any one of them.

    -Ayon
    See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
    Also check out my Blog for answers to commonly asked questions.<http://dailyadvisor.blogspot.com>
    Yes.
    1) Not a bad solution but you when you start having to write custom
    InputFormats and Serdes per each table you get annoyed
    2) Regex serde has poor performance vs delimited because of the complexity
    of regex.
    3) This is how I would do it. Sure it means changes upstream but hey that is
    not your problem. Your the hive guy :)
  • Ayon Sinha at Jul 26, 2011 at 9:52 pm
    I'm confused by this:
    https://issues.cloudera.org/browse/SQOOP-111?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel


    So does Hive actually support the escape characters?
    I am testing it as we speak.

    -Ayon
    See My Photos on Flickr
    Also check out my Blog for answers to commonly asked questions.



    ________________________________
    From: Edward Capriolo <edlinuxguru@gmail.com>
    To: user@hive.apache.org; Ayon Sinha <ayonsinha@yahoo.com>
    Sent: Tuesday, July 26, 2011 2:38 PM
    Subject: Re: URGENT: Hive not respecting escaped delimiter characters





    On Tue, Jul 26, 2011 at 5:13 PM, Ayon Sinha wrote:

    We have database dumps with TAB delimiters. The fields with TAB have them escaped in the dumped text file. But HIVE does not respect escaped delimiters so
    create external table scratch.delete_me (a int, b int, c bigint, d string, e int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/tmp/users';



    creates rows with value of e as NULL for some rows.


    Hive also does not allow multi-character delimiters for ROW FORMAT DELIMITED spec.


    What is the cleanest way to get past this problem? Options are:
    1. Write custom SerDe class
    2. Use RegexSerde
    3. Remove escaped delimiter chars from data


    I need to know the roadblocks before I invest time on any one of them.

    -Ayon
    See My Photos on Flickr
    Also check out my Blog for answers to commonly asked questions.
    Yes.
    1) Not a bad solution but you when you start having to write custom InputFormats and Serdes per each table you get annoyed
    2) Regex serde has poor performance vs delimited because of the complexity of regex.
    3) This is how I would do it. Sure it means changes upstream but hey that is not your problem. Your the hive guy :)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedJul 26, '11 at 9:14p
activeJul 26, '11 at 9:52p
posts3
users2
websitehive.apache.org

2 users in discussion

Ayon Sinha: 2 posts Edward Capriolo: 1 post

People

Translate

site design / logo © 2021 Grokbase