Grokbase Groups Hive user May 2011
FAQ
Hi

I am facing a weird issue with the file parsing. My log files have a thorn
'þ' as separator.
I tried writing a test case for deserializer and kind of confused by the
fact it works fine as I pass the line to the deserializer, however when i
run it on hive the line is not split into columns and table inside hive has
thorn as it is.

Any help would be appreciated.

Thanks
Ankit

Search Discussions

  • Ankit bhatnagar at May 8, 2011 at 12:05 am
    Hi

    I am facing a weird issue with the file parsing. My log files have a thorn
    'þ' as separator.
    I tried writing a test case for deserializer and kind of confused by the
    fact it works fine as I pass the line to the deserializer, however when i
    run it on hive the line is not split into columns and table inside hive has
    thorn as it is.

    Any help would be appreciated.

    Thanks
    Ankit
  • Jasper Knulst at May 8, 2011 at 8:09 pm
    Hi Ankit,

    I know your problem because I had to deal with a thorn 'þ' separated file
    too. Hive ,so far, cannot handle multibyte separators so I turned to the
    custom SerDe option myself. If you manage to capture the 'þ' in the regex
    you could try


    I tried:

    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES ("input.regex" =
    "(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)þ(.*)")

    'þ' is recognized by 'þ' in my case, but this regex was too greedy. In the
    end I had to regex all the fields in between the separators and that was so
    complicated that I wrote a MR job to replace the 'þ' by the '~' which hive
    accepts as a field separator (ROW FORMAT DELIMITED FIELDS TERMINATED BY
    '~'.

    I turned to another solution, and happy I did. Keep us posted if you find
    another way.

    Jasper

    2011/5/8 ankit bhatnagar <abhatnager@gmail.com>
    Hi

    I am facing a weird issue with the file parsing. My log files have a thorn
    'þ' as separator.
    I tried writing a test case for deserializer and kind of confused by the
    fact it works fine as I pass the line to the deserializer, however when i
    run it on hive the line is not split into columns and table inside hive has
    thorn as it is.

    Any help would be appreciated.

    Thanks
    Ankit


    --
    Kind Regards \ Met Vriendelijke Groet,





    Jasper Knulst

    BI Consultant





    VLC Den Haag
    Gildeweg 5B
    2632 BD Nootdorp


    M: +31 (0)6 19 66 75 11

    T: +31 (0)15 764 07 50
    ------------------------------------------------------------

    Skype: jasper_knulst_vlc
  • Ankit bhatnagar at May 9, 2011 at 7:57 pm
    Hi Jasper,

    How did you find - 'þ'

    My browser shows this - �

    Ankit
  • Jasper Knulst at May 9, 2011 at 8:17 pm
    Hi Ankit,

    It all depends on your environment and locale en encoding. This proved to
    work in my case, but I believe to have seen your characters as well, but
    after all it is not your browser that has to do the work and interpret the
    multibyte character. That is the main problem with the thorn; every platform
    and software sees it differently.

    Jasper

    2011/5/9 ankit bhatnagar <abhatnager@gmail.com>
    Hi Jasper,

    How did you find - 'þ'

    My browser shows this - �

    Ankit


    --
    Kind Regards \ Met Vriendelijke Groet,





    Jasper Knulst

    BI Consultant





    VLC Den Haag
    Gildeweg 5B
    2632 BD Nootdorp


    M: +31 (0)6 19 66 75 11

    T: +31 (0)15 764 07 50
    ------------------------------------------------------------

    Skype: jasper_knulst_vlc
  • Ankit bhatnagar at May 9, 2011 at 8:47 pm
    Hi Jasper,

    could you please share your MR program.

    I am not able to grab this character

    Ankit
  • Jasper Knulst at May 9, 2011 at 9:09 pm
    Hi Ankit,


    I got this in my java mapper code

    String oldSeperator = "�"; //the thorn as java sees it
    String newSeperator = "~";

    In Eclipse it shows as �, which is the standard java way of saying "I don't
    know this multibyte character".

    When you copy paste this � to the linux shell it depicts as þ at least in
    RHEL5 with UTF-8 encoding.

    When you paste it in a windows (ansi) based environment it shows as �


    2011/5/9 ankit bhatnagar <abhatnager@gmail.com>
    Hi Jasper,

    could you please share your MR program.

    I am not able to grab this character

    Ankit


    --
    Kind Regards \ Met Vriendelijke Groet,





    Jasper Knulst

    BI Consultant





    VLC Den Haag
    Gildeweg 5B
    2632 BD Nootdorp


    M: +31 (0)6 19 66 75 11

    T: +31 (0)15 764 07 50
    ------------------------------------------------------------

    Skype: jasper_knulst_vlc
  • Jasper Knulst at Sep 7, 2011 at 8:53 am
    have you tried?

    2011/5/9 Jasper Knulst <jasper.knulst@vlc.nl>
    Hi Ankit,


    I got this in my java mapper code

    String oldSeperator = "�"; //the thorn as java sees it
    String newSeperator = "~";

    In Eclipse it shows as �, which is the standard java way of saying "I don't
    know this multibyte character".

    When you copy paste this � to the linux shell it depicts as þ at least in
    RHEL5 with UTF-8 encoding.

    When you paste it in a windows (ansi) based environment it shows as �


    2011/5/9 ankit bhatnagar <abhatnager@gmail.com>
    Hi Jasper,

    could you please share your MR program.

    I am not able to grab this character

    Ankit


    --
    Kind Regards \ Met Vriendelijke Groet,





    Jasper Knulst

    BI Consultant





    VLC Den Haag
    Gildeweg 5B
    2632 BD Nootdorp


    M: +31 (0)6 19 66 75 11

    T: +31 (0)15 764 07 50
    ------------------------------------------------------------

    Skype: jasper_knulst_vlc


    --
    Kind Regards \ Met Vriendelijke Groet,





    Jasper Knulst

    Consultant





    VLC Den Haag
    Gildeweg 5B
    2632 BD Nootdorp


    M: +31 (0)6 19 66 75 11

    T: +31 (0)15 764 07 50
    ------------------------------------------------------------

    Skype: jasper_knulst_vlc

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedMay 8, '11 at 12:00a
activeSep 7, '11 at 8:53a
posts8
users2
websitehive.apache.org

2 users in discussion

Ankit bhatnagar: 4 posts Jasper Knulst: 4 posts

People

Translate

site design / logo © 2021 Grokbase