Hi Tom,

Currently Hive/Hadoop recognizes data as UTF-8.

If your encoding is different, most likely you can still process the data
using Hive without any problems, as long as Hive/Hadoop does not have to do
UTF-8 decoding.

What is the row format of your data? Fields separated by TAB or something?
As long as the encoding does not use the separator for something else (when
as the second or third byte of a character), it should be fine.

On Fri, Sep 25, 2009 at 2:58 PM, tom kersnick wrote:

I have some files with mixed characters from all over the world. utf-8,
latin1, latin9, and like 10 others. These are international files of raw IM
logs. Is there a way to load these files as is into Hadoop? Its smart
enough to interpret the file as is correct? My file sizes are petabytes and
I want to write some Hive queries to find patterns. Please bare with me as
I am a newbie.

I know I can set the character level at the server level, but I want to
make sure there is no other setting that I am missing. For example in
mysql, I can set the language at the DB Level.....

Thanks so much!


Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 2 | next ›
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 25, '09 at 9:58p
activeSep 27, '09 at 12:25a

2 users in discussion

Zheng Shao: 1 post Tom kersnick: 1 post



site design / logo © 2022 Grokbase