FAQ
When converting Unicode strings to legacy character encodings, it is
possible to register a custom error handler that will catch and process
all code points that do not have a direct equivalent in the target
encoding (as described in PEP 293).

The thing to note here is that the error handler itself is required to
return the substitutions as Unicode strings - not as the target encoding
bytestrings. Some lower-level gadgetry will silently convert these
strings to the target encoding.

That is, if the substitution _itself_ doesn't contain illegal code
points for the target encoding.

Which brings us to the point: if my error handler for some reason
returns illegal substitutions (from the viewpoint of the target
encoding), how can I catch _these_ errors and make things good again?

I thought it would work automatically, by calling the error handler as
many times as necessary, and letting it work out the situation, but it
apparently doesn't. Sample code follows:

--- 8< ---

#!/usr/bin/python

import codecs

# ==================================================================
# Here's our error handler
# ==================================================================

def charset_conversion(error):

# error.object = The original unicode string we're trying to
# process and which has characters for which
# there is no mapping in the built-in tables.
#
# error.start = The index position in which the error
# occurred in the string
#
# (See PEP 293 for more information)

# Here's our simple conversion table:

table = {
u"\u2022": u"\u00b7", # "BULLET" to "MIDDLE DOT"
u"\u00b7": u"*" # "MIDDLE DOT" to "ASTERISK"
}

try:

# If we can find the character in our conversion table,
# let's make a substitution

substitution = table[error.object[error.start]]

except KeyError:

# Okay, the character wasn't in our substitution table.
# There's nothing we can do. Better print out its
# unicode codepoint as a hex string instead:

substitution = u"[U+%04x]" % ord(error.object[error.start])

# Return the substituted string and let the built-in codec
# continue from the next position:

return (substitution,error.start+1)

# ==================================================================
# Register the above-defined error handler with the name 'practical'
# ==================================================================

codecs.register_error('practical',charset_conversion)

# ==================================================================
# TEST
# ==================================================================

if __name__ == "__main__":

print

# Here's our test string: Three BULLET symbols, a space,
# the word "TEST", a space again, and three BULLET symbols
# again.

test = u"\u2022\u2022\u2022 TEST \u2022\u2022\u2022"

# Let's see how we can print out it with our new error
# handler - in various encodings.

# The following works - it just converts the internal
# Unicode representation of the above-defined string
# to UTF-8 without ever hitting the custom error handler:

print " UTF-8: "+test.encode('utf-8','practical')

# The next one works, too - it converts the Unicode
# "BULLET" symbols to Latin 1 "MIDDLE DOTs":

print "Latin 1: "+test.encode('iso-8859-1','practical')

# This works as well - it converts the Unicode "BULLET"
# symbols to IBM Codepage 437 "MIDDLE DOTs":

print " CP 437: "+test.encode('cp437','practical')

# The following doesn't work. It should convert the
# Unicode "BULLET" symbols to "ASTERISKS" by calling
# the error handler two times - first time substituting
# the BULLET with the MIDDLE DOT, then finding out
# that that doesn't work for ASCII either, and falling
# back to a yet simpler form (by calling the error
# handler again, which will this time substitute the
# MIDDLE DOT with the ASTERISK) - but apparently it
# doesn't work that way. We'll get a
# UnicodeEncodeError instead.

print " ASCII: "+test.encode('ascii','practical')

# So the question becomes: how can I make this work
# in a graceful manner?

--- 8< ---

--
znark

Search Discussions

  • Serge Orlov at Mar 12, 2006 at 9:33 pm

    Jukka Aho wrote:
    When converting Unicode strings to legacy character encodings, it is
    possible to register a custom error handler that will catch and process
    all code points that do not have a direct equivalent in the target
    encoding (as described in PEP 293).

    The thing to note here is that the error handler itself is required to
    return the substitutions as Unicode strings - not as the target encoding
    bytestrings. Some lower-level gadgetry will silently convert these
    strings to the target encoding.

    That is, if the substitution _itself_ doesn't contain illegal code
    points for the target encoding.

    Which brings us to the point: if my error handler for some reason
    returns illegal substitutions (from the viewpoint of the target
    encoding), how can I catch _these_ errors and make things good again?

    I thought it would work automatically, by calling the error handler as
    many times as necessary, and letting it work out the situation, but it
    apparently doesn't. Sample code follows:


    # So the question becomes: how can I make this work
    # in a graceful manner?
    change the return statement with this code:

    return (substitution.encode(error.encoding,"practical").decode(
    error.encoding), error.start+1)

    -- Serge
  • Jukka Aho at Mar 14, 2006 at 2:36 pm

    Serge Orlov wrote:

    # So the question becomes: how can I make this work
    # in a graceful manner?
    change the return statement with this code:

    return (substitution.encode(error.encoding,"practical").decode(
    error.encoding), error.start+1)
    Thanks, that was a quite neat recursive solution. :) I wouldn't have
    thought of that.

    I ended up doing it without the recursion, by testing the individual
    problematic code points with .encode() within the handler, and catching
    the possible exceptions:

    --- 8< ---

    # This is our original problematic code point:
    c = error.object[error.start]

    while 1:

    # Search for a substitute code point in
    # our table:

    c = table.get(c)

    # If a substitute wasn't found, convert the original code
    # point into a hexadecimal string representation of itself
    # and exit the loop.

    if c == None:
    c = u"[U+%04x]" % ord(error.object[error.start])
    break

    # A substitute was found, but we're not sure if it is OK
    # for for our target encoding. Let's check:

    try:
    c.encode(error.encoding,'strict')
    # No exception; everything was OK, we
    # can break off from the loop now
    break

    except UnicodeEncodeError:
    # The mapping that was found in the table was not
    # OK for the target encoding. Let's loop and try
    # again; there might be a better (more generic)
    # substitution in the chain waiting for us.
    pass

    --- 8< ---

    --
    znark

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 12, '06 at 7:56p
activeMar 14, '06 at 2:36p
posts3
users2
websitepython.org

2 users in discussion

Jukka Aho: 2 posts Serge Orlov: 1 post

People

Translate

site design / logo © 2018 Grokbase