FAQ
Hi,

I have been playing around for two days trying to figure out an issue
related to the default charset:


- When I run a very dummy job which just displays the default charset on
hadoop using the pseudo connected mode, I obtain US-ASCII. When I display
the java property file.encoding I obtain ANSI_X3.4-1968


- When I run the same job under Eclipse in locale mode I obtain UTF-8
(which is the one I expect).

I use a Linux Gentoo distribution, the locale env variables are the
following:

LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-
8"
LC_ALL=en_GB.UTF-8

I have tried to set the file.encoding property to UTF-8 but it doesn't work.
Any help would be greatly appreciated.

Thank you.



--
Bruno Abitbol
bruno.abitbol@jobomix.com
http://www.jobomix.fr

Search Discussions

  • Michael Bieniosek at Dec 18, 2009 at 6:22 pm
    My experience is that it is much better to always use methods that explicitly provide the charset (InputStreamReader+FileInputStream instead of FileReader, one-arg String.getBytes, etc.)

    -Michael

    -----Original Message-----
    From: Bruno Abitbol
    Sent: Friday, December 18, 2009 5:50 AM
    To: common-user@hadoop.apache.org
    Subject: Encoding Hell

    Hi,

    I have been playing around for two days trying to figure out an issue related to the default charset:


    - When I run a very dummy job which just displays the default charset on
    hadoop using the pseudo connected mode, I obtain US-ASCII. When I display
    the java property file.encoding I obtain ANSI_X3.4-1968


    - When I run the same job under Eclipse in locale mode I obtain UTF-8
    (which is the one I expect).

    I use a Linux Gentoo distribution, the locale env variables are the
    following:

    LANG=en_GB.UTF-8
    LC_CTYPE="en_GB.UTF-8"
    LC_NUMERIC="en_GB.UTF-8"
    LC_TIME="en_GB.UTF-8"
    LC_COLLATE="en_GB.UTF-8"
    LC_MONETARY="en_GB.UTF-8"
    LC_MESSAGES="en_GB.UTF-8"
    LC_PAPER="en_GB.UTF-8"
    LC_NAME="en_GB.UTF-8"
    LC_ADDRESS="en_GB.UTF-8"
    LC_TELEPHONE="en_GB.UTF-8"
    LC_MEASUREMENT="en_GB.UTF-8"
    LC_IDENTIFICATION="en_GB.UTF-
    8"
    LC_ALL=en_GB.UTF-8

    I have tried to set the file.encoding property to UTF-8 but it doesn't work.
    Any help would be greatly appreciated.

    Thank you.



    --
    Bruno Abitbol
    bruno.abitbol@jobomix.com
    http://www.jobomix.fr

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedDec 18, '09 at 1:50p
activeDec 18, '09 at 6:22p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase