FAQ
Just a tip for those who are only just cutting their teeth on Python 3.0
and might have encountered the same problem as I did:

When a Python (3.x) program is run on a terminal that only supports a
legacy character encoding - such as Latin 1 or Codepage 437 - all
characters printed to stdout will be automatically converted from the
interpreter's internal Unicode representation to this legacy character
set.

This is a nice feature to have, of course, but if the original Unicode
string contains characters for which there is no equivalent in the
terminal's legacy character set, you will get the dreaded
"UnicodeEncodeError" exception.

In other words, the "sys.stdout" stream - as well as the "sys.stderr"
stream - have both been hardwired to do their character encoding magic,
by default, using the 'strict' error handling scheme:

--- 8< ---
import sys
sys.stdout.errors
'strict'
sys.stderr.errors
'strict'

--- 8< ---

So, essentially, printing out anything but ASCII to stdout is not really
safe in Python... unless you know beforehand, for sure, what characters
the terminal will support - which at least in my mind kind of defeats
the whole purpose of those automatic, implicit conversions.

Now, I have written a more flexible custom error handler myself and
registered it with Python's codec system, using the
codecs.register_error() function. When the handler encounters a
problematic codepoint, it will either suggest a "similar-enough" Latin 1
or ASCII substitution for it, or if there is none available in its
internal conversion table, it will simply print it out using the U+xxxx
notation. The "UnicodeEncodeError" exception will never occur with it.

Instead of creating a custom error handler from scratch one could also
make use of one of Python's built-in, less restrictive error handlers,
such as 'ignore', 'replace', 'xmlcharrefreplace', or 'backslashreplace'.

But in order to make things work as transparently and smoothly as
possible, I needed a way to make both the "sys.stdio" and "sys.stderr"
streams actually _use_ my custom error handler, instead of the default
one.

Unfortunately, the current implementation of io.TextIOWrapper (in Python
3.0b2, at least) does not yet offer a public, documented interface for
changing the codec error handler - or, indeed, the target encoding
itself - for streams that have already been opened, and this means you
can't "officially" change it for the "stdout" or "stderr" streams,
either. (The need for this functionality is acknowledged in PEP-3116,
but has apparently not been implemented yet. [1])

So, after examining io.py and scratching my head a bit, here's how one
can currently hack one's way around this limitation:

--- 8< ---

import sys
sys.stdout._errors = 'backslashreplace'
sys.stdout._get_encoder()
sys.stderr._errors = 'backslashreplace'
sys.stderr._get_encoder()

--- 8< ---

Issuing these commands makes printing out Unicode strings to a legacy
terminal a safe procedure again and you're not going get unexpected
"UnicodeEncodeErrors" thrown in your face any longer. (Note:
'backslashreplace' is just an example here; you could substitute the
error handler of your choice for it.)

The downside of this solution is, of course, that it will break down if
the private implementation of io.TextIOWrapper in io.py changes in the
future. But as a workaround, I feel it is sufficient for now, while
waiting for the "real" support to appear in the library.

(If there's a cleaner and more future-proof way of doing the same thing
right now, I'd of course love to hear about it...)

_____

1. http://mail.python.org/pipermail/python-3000/2008-April/013366.html

--
znark

Search Discussions

  • John Nagle at Sep 2, 2008 at 4:57 pm

    Jukka Aho wrote:
    Just a tip for those who are only just cutting their teeth on Python 3.0
    and might have encountered the same problem as I did:

    When a Python (3.x) program is run on a terminal that only supports a
    legacy character encoding - such as Latin 1 or Codepage 437 - all
    characters printed to stdout will be automatically converted from the
    interpreter's internal Unicode representation to this legacy character set.
    Python 5 is even stricter. Only ASCII (chars 0..127) can be sent
    to standard output by default.

    John Nagle
  • Steven D'Aprano at Sep 2, 2008 at 4:54 pm

    On Tue, 02 Sep 2008 09:57:05 -0700, John Nagle wrote:

    Jukka Aho wrote:
    Just a tip for those who are only just cutting their teeth on Python
    3.0 and might have encountered the same problem as I did:

    When a Python (3.x) program is run on a terminal that only supports a
    legacy character encoding - such as Latin 1 or Codepage 437 - all
    characters printed to stdout will be automatically converted from the
    interpreter's internal Unicode representation to this legacy character
    set.
    Python 5 is even stricter. Only ASCII (chars 0..127) can be sent
    to standard output by default.

    Python 5??? Is this the time machine again?



    --
    Steven
  • Jukka Aho at Sep 2, 2008 at 10:53 pm

    John Nagle wrote:

    Python 5 is even stricter. Only ASCII (chars 0..127) can be sent
    to standard output by default.
    Python 5? (I guess I haven't been following these things enough...)

    Well, I would sure hope not.

    --
    znark

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedSep 2, '08 at 7:09a
activeSep 2, '08 at 10:53p
posts4
users3
websitepython.org

People

Translate

site design / logo © 2018 Grokbase