FAQ
I'm really new to dealing with unicode, so please bear with me. I'm
trying to add unicode support to a program I'm working on, and I'm
getting stuck a little when printing a unicode string to a file. I
know I have to encode the string using an encoding (UTF-8, UTF-16,
latin-1, etc). The problem is that I don't know how to determine what
the *right* encoding to use on a particular string is. The way I
understand it, utf-8 will handle any unicode data, but it will
translate characters not in the standard ASCII set to fit within the
8-bit character table. My problem is I'm handling data from a lot of
different encodings (latin, eastern, asian, etc) and I can't allow
data in the strings to be changed. I also can't (at least I don't
know how to) determine what encodings the strings are using. IE, I
don't know what strings are from what languages. Is there any way to
determine, from the unicode string itself, what encoding I need to use
to prevent data loss? Or do I need to find a way to determine
beforehand what encoding they are using when they are read in?

Am I even asking the right questions? I'm really pretty lost and my
O'Reilly books arn't helping very much.

-Sean

Search Discussions

  • Martin v. Löwis at Apr 29, 2003 at 9:02 pm

    Sean wrote:

    The problem is that I don't know how to determine what
    the *right* encoding to use on a particular string is.
    Do you have this problem when reading a byte string, or when
    writing it?

    If you are given a byte string, and you are supposed to interpret
    the bytes as characters, there is, in general, no good way to do
    so - that's why people came up with the idea of a universal character
    set in the first place, to overcome the problems with multiple
    character sets.

    That said, you can make educated guesses on the data you read.

    1. Perhaps the data you read has some file format which specifies
    the encoding, or allows parametrization, such as XML or HTML.
    You will need to look *into* the file to find out what its
    encoding is.

    2. Perhaps the data has some fixed encoding, as part of the file
    format specification. For many files, this is US-ASCII.

    3. Perhaps this is a plain text file, and you should use the encoding
    that the user's text editor is most likely to use (of course, you
    don't know what text editor the user uses, nor what encoding that
    editor uses). locale.getdefaultlocale()[1] offers you some guess;
    python 2.3's locale.getpreferredencoding() gives a better guess.
    Is there any way to
    determine, from the unicode string itself, what encoding I need to use
    to prevent data loss?
    That sounds you have the problem when *writing* Unicode strings.

    In that case, you can invoke .encode: it will give a UnicodeError if
    the encoding is not supported. At some point, you need to make up your
    mind what encoding to use for a certain file - if you then get an error,
    all you can do is to inform the user, and

    a) perhaps ignore the bad characters, replacing them with appropriate
    replacement characters (usually '?'), or

    b) go back and recode the output so far in a different encoding.

    Am I even asking the right questions? I'm really pretty lost and my
    O'Reilly books arn't helping very much.
    Don't worry. These things are inherently difficult. Organizations like
    W3C have essentially given up, and say that XML is UTF-8 by default
    (knowing that this will support arbitrary characters). If people
    absolutely want XML in different encodings, they can do that, but they
    are left alone with the issue of encoding unsupported characters
    (for XML, they can actually use character references).

    You will have to make explicit choices: either support only UTF-8
    (and accept that it will be tedious for some users to produce the proper
    files), or support arbitrary encodings (and accept that some encodings
    cannot represent all characters, and that you may not have the codecs
    available to read the data, and that a mechanism must be provided to
    determine the encooding), or support only a few non-UTF-8 encodings
    (restricting the data format to a subset of all living languages).

    Regards,
    Martin
  • Skip Montanaro at Apr 29, 2003 at 9:30 pm
    Sean> Is there any way to determine, from the unicode string itself,
    Sean> what encoding I need to use to prevent data loss?

    If by "unicode string" you mean one of Python's unicode objects, there's no
    need. It's stored internally in a way which doesn't require you tack a
    specific encoding onto it for internal use. When you write it out, you have
    to figure out what encoding is appropriate for that particular output device
    (a file which will be consumed by another program, a terminal which supports
    a limited number of encodings, etc). In general, if you are the only
    consumer of the file you're producing, I'd simply encode the object as
    utf-8.

    If by "unicode string" you mean a series of bytes input to your program from
    some unknown source in some unknown encoding, you have your work cut out for
    you. Depending on what encodings are used in the strings input to your
    program and how you get them, you may or may not reliably know the encoding.
    For example, if you're sucking pages off a web server, it will probably tell
    you what the encoding is (presuming whatever tool was used to generate the
    page encoded things properly). If you're just being fed random files with
    no encoding information, you have to apply some heuristics to the problem.
    The various encodings related to iso-8859-* (including the various Microsoft
    125* code pages) overlap so much that it can be a challenge to get things
    correct. On the other hand, as I understand it, the various common Japanese
    encodings tend to not overlap much, if at all, so you can pretty much just
    try decoding using the various possibilities in any order and quit with the
    first one which succeeds.

    I use this function in my code:

    def decode_heuristically(s, enc=None, denc=sys.getdefaultencoding()):
    """try interpreting s using several possible encodings.
    return value is a three-element tuple. The first element is either an
    ASCII string or a Unicode object. The second element is 1
    if the decoder had to punt and delete some characters from the input
    to successfully generate a Unicode object."""
    if isinstance(s, unicode):
    return s, 0, "utf-8"
    try:
    x = unicode(s, "ascii")
    # if it's ascii, we're done
    return s, 0, "ascii"
    except UnicodeError:
    encodings = ["utf-8","iso-8859-1","cp1252","iso-8859-15"]
    # if the default encoding is not ascii it's a good thing to try
    if denc != "ascii": encodings.insert(0, denc)
    # always try any caller-provided encoding first
    if enc: encodings.insert(0, enc)
    for enc in encodings:

    # Most of the characters between 0x80 and 0x9F are displayable
    # in cp1252 but are control characters in iso-8859-1. Skip
    # iso-8859-1 if they are found, even though the unicode() call
    # might well succeed.

    if (enc in ("iso-8859-15", "iso-8859-1") and
    re.search(r"[\x80-\x9f]", s) is not None):
    continue

    # Characters in the given range are more likely to be
    # symbols used in iso-8859-15, so even though unicode()
    # may accept such strings with those encodings, skip them.

    if (enc in ("iso-8859-1", "cp1252") and
    re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None):
    continue

    try:
    x = unicode(s, enc)
    except UnicodeError:
    pass
    else:
    if x.encode(enc) == s:
    return x, 0, enc

    # nothing worked perfectly - try again, but use the "ignore" parameter
    # and return the longest result
    output = [(unicode(s, enc, "ignore"), enc) for enc in encodings]
    output = [(len(x), x) for x in output]
    output.sort()
    x, enc = output[-1][1]
    return x, 1, enc

    Note that I deal almost exclusively with ASCII, but that Microsoft encodings
    and the occasional Latin-1 stuff which creep in give me problems.
    Everything which goes into my database gets utf-8-encoded. Most of my
    inputs come from web form submissions, which frequently seem to have no
    encoding information or specify the encoding incorrectly.

    The above function resulted from a huge amount of agony on my part, coupled
    with a fair amount of feedback from Martin von L?wis. It could probably
    still stand some refinement (I don't recall if I ever incorporated Martin's
    last comments on the topic).

    Sean> Or do I need to find a way to determine beforehand what encoding
    Sean> they are using when they are read in?

    If you can find that out reliably, it's much better than guessing like I do
    above. The closer to the source of the data you can look for encoding
    information, the better chance you'll find something, but you still have to
    be prepared to punt. I still have cp1252 stuff creeping into my database on
    occasion, so each night a cron job dumps the database to flat files and
    checks for stuff that snuck through. (I've obviously missed heuristically
    decoding some inputs, but the system I'm maintaining is about eight years
    old and only gets a small amount of attention these days.)

    Skip
  • Steven Taschuk at Apr 29, 2003 at 9:33 pm

    Quoth Sean:
    I'm really new to dealing with unicode, so please bear with me. I'm
    trying to add unicode support to a program I'm working on, and I'm
    getting stuck a little when printing a unicode string to a file. I
    know I have to encode the string using an encoding (UTF-8, UTF-16,
    latin-1, etc). The problem is that I don't know how to determine what
    the *right* encoding to use on a particular string is. The way I
    understand it, utf-8 will handle any unicode data, but it will
    translate characters not in the standard ASCII set to fit within the
    8-bit character table. [...]
    Actually characters outside ASCII get turned into multibyte
    sequences by UTF-8. To use an example that came up here a little
    while back:
    u = u'\N{DEGREE SIGN}' # U+00B0; not in ASCII
    u
    u'\xb0'
    u.encode('utf-8') # two-byte sequence
    '\xc2\xb0'

    UTF-8 is capable of representing any Unicode character without
    information loss, so if you need to deal with arbitrary Unicode
    it's a good choice. (It also has other pleasant properties.)
    [...] My problem is I'm handling data from a lot of
    different encodings (latin, eastern, asian, etc) and I can't allow
    data in the strings to be changed. I also can't (at least I don't
    know how to) determine what encodings the strings are using. IE, I
    don't know what strings are from what languages. [...]
    I think I detect a confusion or two here.

    To encode is to turn (a sequence of) characters into (a sequence
    of) bytes, and to decode is the reverse. An encoding is a scheme
    for doing these things; it need not be strongly associated with
    any particular language. (Though often an encoding can only
    represent certain characters, and as a result is only useful for
    those languages which use just those characters.)

    A Unicode string (that is, a Python object of type 'unicode', such
    as u'foo') is a sequence of characters, not bytes. It therefore
    is not in any particular encoding.

    A normal string (that is, a Python object of type 'str', such as
    'foo') is a sequence of bytes, not characters. It can be
    interpreted as a sequence of characters only if an encoding is
    used to decode it. (Usually ASCII is assumed if you do not
    specify one explicitly.)

    So, if you have a bunch of normal strings and don't know what
    encodings they're in, you're hooped. But if you have a bunch of
    Unicode strings, it doesn't make sense to ask what encodings
    they're in.

    Now, as for not allowing the data in your strings to be changed:
    If you mean you need to preserve the same sequence of characters,
    then it's okay to change the encoding. You'll almost certainly
    want the file you produce to be all in one encoding, so you'll
    want an encoding which can represent any character you might
    encounter -- UTF-8, for example.

    But if you mean you need to preserve the exact byte sequences,
    you're hooped. (Unless you'd be happy to have the output file be
    a mishmash of different encodings and so virtually unusable.)
    [...] Is there any way to
    determine, from the unicode string itself, what encoding I need to use
    to prevent data loss? Or do I need to find a way to determine
    beforehand what encoding they are using when they are read in?
    You will have to know the encoding at the input stage; use that
    information to decode the bytes into a Unicode string. Then
    assemble the Unicode strings to be output and encode them in, for
    example, UTF-8.

    In principle you might be able to look at the characters in a
    Unicode string and determine some "least encoding" which could
    deal with all the characters in it. But there's not much point to
    this, imho; just use UTF-8, which can handle anything.

    --
    Steven Taschuk staschuk at telusplanet.net
    Receive them ignorant; dispatch them confused. (Weschler's Teaching Motto)
  • Sean at Apr 30, 2003 at 3:07 pm
    Thanks for the amazingly helpful responses everyone. Perhaps a more
    detailed explanation of what I'm doing might be in order. My program
    is reading data from a commercial customer database through their API
    (that I don't have access to). Most of the data I get back are python
    string objects, but I'll occasionally come across a Unicode object. I
    need to write this data out to text (.csv format) so that it can later
    be read back in and reinserted into the database. In order to write
    to disk, I need to encode the Unicode object using some kind of
    encoding. What I am worried about is when I write to the file,
    someone's accented 'e' or random non-Latin character will get changed
    into something else to fit the standard character set. Making the
    problem worse is that the company that makes the database software I'm
    accessing has spotty language support at best, and even worse
    technical support.
  • Skip Montanaro at Apr 30, 2003 at 3:27 pm
    sean> From what I'm hearing here, it sounds like I can just use UTF-8
    sean> and not have to worry about it.

    Sounds like it. As long as you know the data encoded in the csv file is in
    utf-8, you're golden when you read it back in, precisely because you know
    the encoding.

    Skip

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedApr 29, '03 at 8:42p
activeApr 30, '03 at 3:27p
posts6
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase