FAQ
I have a tab delimited text file and I read using TextInputFormat. I have
problems reading lines from the txt file with ascii code > 127 e.g.

P 676827 Martin Plachý amg

gets read as

P 676827 Martin Plach? with missing 3rd tab delimited column. Whats the
best way to handle this kind of input? thanks
--
View this message in context: http://www.nabble.com/Text-encoding-tp24684865p24684865.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Todd Lipcon at Jul 27, 2009 at 5:45 pm
    Hi,

    The issue here is that your input data is not in fact ASCII - there are no
    ascii characters with code > 127. My guess is that your input is in latin1
    encoding or something else that defines one-byte character codes with value
    127.
    TextInputFormat (and anything else that uses the Text writable type) assumes
    the input is UTF8. True ASCII is a subset of UTF8, whereas latin1 is invalid
    UTF8.

    As for the best solution here, I'm not exactly sure. Hopefully someone else
    can pipe up with a trick to get an inputformat that works on non-UTF8 data.

    -Todd
    On Mon, Jul 27, 2009 at 10:22 AM, pmg wrote:


    I have a tab delimited text file and I read using TextInputFormat. I have
    problems reading lines from the txt file with ascii code > 127 e.g.

    P 676827 Martin Plachý amg

    gets read as

    P 676827 Martin Plach? with missing 3rd tab delimited column. Whats
    the
    best way to handle this kind of input? thanks
    --
    View this message in context:
    http://www.nabble.com/Text-encoding-tp24684865p24684865.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 27, '09 at 5:21p
activeJul 27, '09 at 5:45p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Pmg: 1 post Todd Lipcon: 1 post

People

Translate

site design / logo © 2021 Grokbase