FAQ
hi,
cat test.py
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'mo?ambique'
print u.encode("utf-8")
print u

chmod +x test.py
./test.py
mo?ambique
mo?ambique

./test.py > output.txt
Traceback (most recent call last):
File "./test.py", line 5, in <module>
print u
UnicodeEncodeError: 'ascii' codec can't encode character
u'\xe7' in position 2: ordinal not in range(128)

in python 2.7
how I explain to python to send the same thing to stdout and
the file output.txt ?

Don't seems logic, when send things to a file the beaviour
change.

Thanks,
S?rgio M. B.

Search Discussions

  • Ben Finney at Jun 9, 2011 at 2:39 am

    S?rgio Monteiro Basto <sergiomb at sapo.pt> writes:

    ./test.py
    mo?ambique
    mo?ambique
    In this case your terminal is reporting its encoding to Python, and it's
    capable of taking the UTF-8 data that you send to it in both cases.
    ./test.py > output.txt
    Traceback (most recent call last):
    File "./test.py", line 5, in <module>
    print u
    UnicodeEncodeError: 'ascii' codec can't encode character
    u'\xe7' in position 2: ordinal not in range(128)
    In this case your shell has no preference for the encoding (since you're
    redirecting output to a file).

    In the first print statement you specify the encoding UTF-8, which is
    capable of encoding the characters.

    In the second print statement you haven't specified any encoding, so the
    default ASCII encoding is used.


    Moral of the tale: Make sure an encoding is specified whenever data
    steps between bytes and characters.
    Don't seems logic, when send things to a file the beaviour change.
    They're different files, which have been opened with different
    encodings. If you want a different encoding, you need to specify that.

    --
    \ ?There's no excuse to be bored. Sad, yes. Angry, yes. |
    `\ Depressed, yes. Crazy, yes. But there's no excuse for boredom, |
    _o__) ever.? ?Viggo Mortensen |
    Ben Finney
  • Sérgio Monteiro Basto at Jun 9, 2011 at 9:16 pm

    Ben Finney wrote:

    S?rgio Monteiro Basto <sergiomb at sapo.pt> writes:
    ./test.py
    mo?ambique
    mo?ambique
    In this case your terminal is reporting its encoding to Python, and it's
    capable of taking the UTF-8 data that you send to it in both cases.
    ./test.py > output.txt
    Traceback (most recent call last):
    File "./test.py", line 5, in <module>
    print u
    UnicodeEncodeError: 'ascii' codec can't encode character
    u'\xe7' in position 2: ordinal not in range(128)
    In this case your shell has no preference for the encoding (since you're
    redirecting output to a file).
    How I say to python that I want that write in utf-8 to files ?

    In the first print statement you specify the encoding UTF-8, which is
    capable of encoding the characters.

    In the second print statement you haven't specified any encoding, so the
    default ASCII encoding is used.


    Moral of the tale: Make sure an encoding is specified whenever data
    steps between bytes and characters.
    Don't seems logic, when send things to a file the beaviour change.
    They're different files, which have been opened with different
    encodings. If you want a different encoding, you need to specify that.
  • Ben Finney at Jun 9, 2011 at 11:19 pm

    S?rgio Monteiro Basto <sergiomb at sapo.pt> writes:

    Ben Finney wrote:
    In this case your shell has no preference for the encoding (since
    you're redirecting output to a file).
    How I say to python that I want that write in utf-8 to files ?
    You already did:
    In the first print statement you specify the encoding UTF-8, which
    is capable of encoding the characters.
    If you want UTF-8 on the byte stream for a file, specify it when opening
    the file, or when reading or writing the file.

    --
    \ ?But Marge, what if we chose the wrong religion? Each week we |
    `\ just make God madder and madder.? ?Homer, _The Simpsons_ |
    _o__) |
    Ben Finney
  • Benjamin Kaplan at Jun 9, 2011 at 3:00 am

    2011/6/8 S?rgio Monteiro Basto <sergiomb at sapo.pt>:
    hi,
    cat test.py
    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    u = u'mo?ambique'
    print u.encode("utf-8")
    print u

    chmod +x test.py
    ./test.py
    mo?ambique
    mo?ambique

    ./test.py > output.txt
    Traceback (most recent call last):
    ?File "./test.py", line 5, in <module>
    ? ?print u
    UnicodeEncodeError: 'ascii' codec can't encode character
    u'\xe7' in position 2: ordinal not in range(128)

    in python 2.7
    how I explain to python to send the same thing to stdout and
    the file output.txt ?

    Don't seems logic, when send things to a file the beaviour
    change.

    Thanks,
    S?rgio M. B.
    That's not a terminal vs file thing. It's a "file that declares it's
    encoding" vs a "file that doesn't declare it's encoding" thing. Your
    terminal declares that it is UTF-8. So when you print a Unicode string
    to your terminal, Python knows that it's supposed to turn it into
    UTF-8. When you pipe the output to a file, that file doesn't declare
    an encoding. So rather than guess which encoding you want, Python
    defaults to the lowest common denominator: ASCII. If you want
    something to be a particular encoding, you have to encode it yourself.

    You have a couple of choices on how to make it work:
    1) Play dumb and always encode as UTF-8. This would look really weird
    if someone tried running your program in a terminal with a CP-847
    encoding (like cmd.exe on at least the US version of Windows), but it
    would never crash.
    2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
    string in the string-escape encoding, which substitutes the escape
    sequence in for all non-ASCII characters.
    3) Check to see if sys.stdout.isatty() and have different behavior for
    terminals vs files. If you're on a terminal that doesn't declare its
    encoding, encoding it as UTF-8 probably won't help. If you're writing
    to a file, that might be what you want to do.
  • Sérgio Monteiro Basto at Jun 9, 2011 at 9:14 pm

    Benjamin Kaplan wrote:

    2011/6/8 S?rgio Monteiro Basto <sergiomb at sapo.pt>:
    hi,
    cat test.py
    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    u = u'mo?ambique'
    print u.encode("utf-8")
    print u

    chmod +x test.py
    ./test.py
    mo?ambique
    mo?ambique

    ./test.py > output.txt
    Traceback (most recent call last):
    File "./test.py", line 5, in <module>
    print u
    UnicodeEncodeError: 'ascii' codec can't encode character
    u'\xe7' in position 2: ordinal not in range(128)

    in python 2.7
    how I explain to python to send the same thing to stdout and
    the file output.txt ?

    Don't seems logic, when send things to a file the beaviour
    change.

    Thanks,
    S?rgio M. B.
    That's not a terminal vs file thing. It's a "file that declares it's
    encoding" vs a "file that doesn't declare it's encoding" thing. Your
    terminal declares that it is UTF-8. So when you print a Unicode string
    to your terminal, Python knows that it's supposed to turn it into
    UTF-8. When you pipe the output to a file, that file doesn't declare
    an encoding. So rather than guess which encoding you want, Python
    defaults to the lowest common denominator: ASCII. If you want
    something to be a particular encoding, you have to encode it yourself.
    Exactly the opposite , if python don't know the encoding should not try
    decode to ASCII.
    You have a couple of choices on how to make it work:
    1) Play dumb and always encode as UTF-8. This would look really weird
    if someone tried running your program in a terminal with a CP-847
    encoding (like cmd.exe on at least the US version of Windows), but it
    would never crash.
    I want python don't care about encoding terminal and send characters as they
    are or for a file .
    2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
    string in the string-escape encoding, which substitutes the escape
    sequence in for all non-ASCII characters.
    How I change sys.stdout.encoding always to UTF-8 ? at least have a
    consistent sys.stdout.encoding
    3) Check to see if sys.stdout.isatty() and have different behavior for
    terminals vs files. If you're on a terminal that doesn't declare its
    encoding, encoding it as UTF-8 probably won't help. If you're writing
    to a file, that might be what you want to do.

    Thanks,
  • Nobody at Jun 9, 2011 at 9:46 pm

    On Thu, 09 Jun 2011 22:14:17 +0100, S?rgio Monteiro Basto wrote:

    Exactly the opposite , if python don't know the encoding should not try
    decode to ASCII.
    What should it decode to, then?

    You can't write characters to a stream, only bytes.
    I want python don't care about encoding terminal and send characters as they
    are or for a file .
    You can't write characters to a stream, only bytes.
  • Terry Reedy at Jun 10, 2011 at 12:14 am

    On 6/9/2011 5:46 PM, Nobody wrote:
    On Thu, 09 Jun 2011 22:14:17 +0100, S?rgio Monteiro Basto wrote:

    Exactly the opposite , if python don't know the encoding should not try
    decode to ASCII.
    What should it decode to, then?

    You can't write characters to a stream, only bytes.
    I want python don't care about encoding terminal and send characters as they
    are or for a file .
    You can't write characters to a stream, only bytes.
    Characters, representations are for people, byte representations are for
    computers.

    --
    Terry Jan Reedy
  • Sérgio Monteiro Basto at Jun 10, 2011 at 1:11 am

    Nobody wrote:

    Exactly the opposite , if python don't know the encoding should not try
    decode to ASCII.
    What should it decode to, then?
    UTF-8, as in tty, how I change this default ?
    You can't write characters to a stream, only bytes.
    ok got the point .
    Thanks,
  • Ben Finney at Jun 10, 2011 at 1:45 am

    S?rgio Monteiro Basto <sergiomb at sapo.pt> writes:

    Nobody wrote:
    Exactly the opposite , if python don't know the encoding should not
    try decode to ASCII.
    Are you advocating that Python should refuse to write characters unless
    the encoding is specified? I could sympathise with that, but currently
    that's not what Python does; instead it defaults to the ASCII codec.
    What should it decode to, then?
    UTF-8, as in tty
    But when you explicitly redirect to a file, it's *not* going to a TTY.
    It's going to a file whose encoding isn't known unless you specify it.

    --
    \ ?Reality must take precedence over public relations, for nature |
    `\ cannot be fooled.? ?Richard P. Feynman |
    _o__) |
    Ben Finney
  • Sérgio Monteiro Basto at Jun 10, 2011 at 1:59 am

    Ben Finney wrote:

    Exactly the opposite , if python don't know the encoding should not
    try decode to ASCII.
    Are you advocating that Python should refuse to write characters unless
    the encoding is specified? I could sympathise with that, but currently
    that's not what Python does; instead it defaults to the ASCII codec.
    could be a solution ;) or a smarter default based on LANG for example (as
    many GNU does).

    --
    S?rgio M. B.
  • Sérgio Monteiro Basto at Jun 10, 2011 at 3:11 pm

    Ben Finney wrote:

    What should it decode to, then?
    UTF-8, as in tty
    But when you explicitly redirect to a file, it's not going to a TTY.
    It's going to a file whose encoding isn't known unless you specify it.
    ok after thinking about this, this problem exist because Python want be
    smart with ttys, which is in my point of view is wrong, should not encode to
    utf-8, because tty is in utf-8. Python should always encode to the same
    thing. If the default is ascii, should always encode to ascii.
    yeah should send to tty in ascii, if I send my code to a guy in windows
    which use tty with cp1000whatever , shouldn't give decoding errors and
    should send in ascii .
    If we want we change default for whatever we want, but without this "default
    change" Python should not change his behavior depending on output.
    yeah I prefer strange output for a different platform, to a decode errors.
    And I have /usr/bin/iconv .

    Thanks for attention, sorry about my very limited English.
    --
    S?rgio M. B.
  • Ian Kelly at Jun 10, 2011 at 4:58 pm

    2011/6/10 S?rgio Monteiro Basto <sergiomb at sapo.pt>:
    ok after thinking about this, this problem exist because Python want be
    smart with ttys, which is in my point of view is wrong, should not encode to
    utf-8, because tty is in utf-8. Python should always encode to the same
    thing. If the default is ascii, should always encode to ascii.
    yeah should send to tty in ascii, if I send my code to a guy in windows
    which use tty with cp1000whatever , shouldn't give decoding errors and
    should send in ascii .
    You can't have your cake and eat it too. If Python needs to output a
    string in ascii, and that string can't be represented in ascii, then
    raising an exception is the only reasonable thing to do. You seem to
    be suggesting that Python should do an implicit output.encode('ascii',
    'replace') on all Unicode output, which might be okay for a TTY, but
    you wouldn't want that for file output; it would allow Python to
    silently create garbage data.

    And what if you send your code to somebody with a UTF-16 terminal?
    You try to output ASCII to that, and you're just going to get complete
    garbage.

    If you want your output to behave that way, then all you have to do is
    specify that with an explicit encode step.
    If we want we change default for whatever we want, but without this "default
    change" Python should not change his behavior depending on output.
    yeah I prefer strange output for a different platform, to a decode errors.
    Sorry, I disagree. If your program is going to fail, it's better that
    it fail noisily (with an error) than silently (with no notice that
    anything is wrong).
  • Chris Angelico at Jun 10, 2011 at 10:07 pm

    2011/6/11 S?rgio Monteiro Basto <sergiomb at sapo.pt>:
    ok after thinking about this, this problem exist because Python want be
    smart with ttys
    The *anomaly* (not problem) exists because Python has a way of being
    told a target encoding. If two parties agree on an encoding, they can
    send characters to each other. I had this discussion at work a while
    ago; my boss was talking about being "binary-safe" (which really meant
    "8-bit safe"), while I was saying that we should support, verify, and
    demand properly-formed UTF-8. The main significance is that agreeing
    on an encoding means we can change the encoding any time it's
    convenient, without having to document that we've changed the data -
    because we haven't. I can take the number "twelve thousand three
    hundred and forty-five" and render that as a string of decimal digits
    as "12345", or as hexadecimal digits as "3039", but I haven't changed
    the number. If you know that I'm giving you a string of decimal
    digits, and I give you "12345", you will get the same number at the
    far side.

    Python has agreed with stdout that it will send it characters encoded
    in UTF-8. Having made that agreement, Python and stdout can happily
    communicate in characters, not bytes. You don't need to explicitly
    encode your characters into bytes - and in fact, this would be a very
    bad thing to do, because you don't know _what_ encoding stdout is
    using. If it's expecting UTF-16, you'll get a whole lot of rubbish if
    you send it UTF-8 - but it'll look fine if you send it Unicode.

    Chris Angelico
  • Sérgio Monteiro Basto at Jun 13, 2011 at 2:15 pm

    Ian Kelly wrote:

    If you want your output to behave that way, then all you have to do is
    specify that with an explicit encode step.
    ok
    If we want we change default for whatever we want, but without this
    "default change" Python should not change his behavior depending on
    output. yeah I prefer strange output for a different platform, to a
    decode errors.
    Sorry, I disagree. If your program is going to fail, it's better that
    it fail noisily (with an error) than silently (with no notice that
    anything is wrong).
    Hi,
    ok a little resume, I got the solution which is setting env with
    PYTHONIOENCODING=utf-8, which if it was a default for modern GNU Linux, was
    made me save lots of time.
    My practical problem is simple like, I make a script that want run in shell
    for testing and log to a file when use with a configuration.
    Everything runs well in a shell and sometimes (later) fails when log to a
    file, with a "UnicodeEncodeError: 'ascii' codec can't encode character
    u'\xe7' in position".
    So to work in both cases (tty and files), I filled all code with string
    .encode('utf-8') to workaround, when what always I want was use
    PYTHONIOCONDIG=utf-8. I got anything in utf-8, database is in utf-8, I
    coding in utf-8, my OS is in utf-8. In last about 3 years of learning Python
    I lost many many hours to understand this problem.
    And see, I can send ascii and utf-8 to utf-8 output and never have problems,
    but if I send ascii and utf-8 to ascii files sometimes got encode errors.
    So you please consider, at least on Linux, default encode to utf-8 (because
    we have less problems) or make more clear that pipe to a file is different
    to a tty and problem was in files that defaults to ascii. Or
    make the default of IOENCONDIG based on env LANG.

    Anyway many thanks for your time and for help me out.
    I don't know how run the things in Python 3 , in python 3 defaults are utf-8
    ?

    Thanks,
    --
    S?rgio M. B.
  • Chris Angelico at Jun 13, 2011 at 2:49 pm

    2011/6/14 S?rgio Monteiro Basto <sergiomb at sapo.pt>:
    And see, I can send ascii and utf-8 to utf-8 output and never have problems,
    but if I send ascii and utf-8 to ascii files sometimes got encode errors.
    If something fits inside 7-bit ASCII, it is by definition valid UTF-8.
    This is not a coincidence.

    Those hours you've spent grokking this are not wasted, if you now have
    a comprehension of characters vs encodings. More people in the world
    need to understand that difference! :)

    Chris Angelico
  • Mark Tolonen at Jun 10, 2011 at 12:57 am
    "S?rgio Monteiro Basto" <sergiomb at sapo.pt> wrote in message
    news:4df137a7$0$30580$a729d347 at news.telepac.pt...
    How I change sys.stdout.encoding always to UTF-8 ? at least have a
    consistent sys.stdout.encoding
    There is an environment variable that can force Python I/O to be a specfic
    encoding:

    PYTHONIOENCODING=utf-8

    -Mark
  • Sérgio Monteiro Basto at Jun 10, 2011 at 1:17 am

    Mark Tolonen wrote:
    "S?rgio Monteiro Basto" <sergiomb at sapo.pt> wrote in message
    news:4df137a7$0$30580$a729d347 at news.telepac.pt...
    How I change sys.stdout.encoding always to UTF-8 ? at least have a
    consistent sys.stdout.encoding
    There is an environment variable that can force Python I/O to be a specfic
    encoding:

    PYTHONIOENCODING=utf-8
    Excellent thanks , double thanks.

    BTW: should be set by default on a utf-8 systems like Fedora, Ubuntu, Debian
    , Redhat, and all Linuxs. For sure I will put this on startup of my systems.
    -Mark
    --
    S?rgio M. B.
  • Laurent Claessens at Jun 10, 2011 at 5:47 am

    Le 09/06/2011 04:18, S?rgio Monteiro Basto a ?crit :
    hi,
    cat test.py
    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    u = u'mo?ambique'
    print u.encode("utf-8")
    print u >
    chmod +x test.py
    ../test.py
    mo?ambique
    mo?ambique

    The following tries to encode before to print. If you pass an already
    utf-8 object, it just print it; if not it encode it. All the "print"
    statements pass by MyPrint.write

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-

    import sys

    class MyPrint(object):
    def __init__(self):
    self.old_stdout=sys.stdout
    sys.stdout=self
    def write(self,text):
    try:
    encoded=text.encode("utf8")
    except UnicodeDecodeError:
    encoded=text
    self.old_stdout.write(encoded)


    MyPrint()

    u = u'mo?ambique'
    print u.encode("utf-8")
    print u

    TEST :

    $ ./test.py
    mo?ambique
    mo?ambique

    $ ./test.py > test.txt
    $ cat test.txt
    mo?ambique
    mo?ambique


    By the way, my code will not help for error message. I think that the
    errors are printed by sys.stderr.write. So if you want to do
    raise "mo?ambique"
    you should think about add stderr to the class MyPrint


    If you know French, I strongly recommend "Comprendre les erreurs
    unicode" by Victor Stinner :
    http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

    Have a nice day
    Laurent
  • Laurent Claessens at Jun 10, 2011 at 5:47 am

    Le 09/06/2011 04:18, S?rgio Monteiro Basto a ?crit :
    hi,
    cat test.py
    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    u = u'mo?ambique'
    print u.encode("utf-8")
    print u >
    chmod +x test.py
    ../test.py
    mo?ambique
    mo?ambique

    The following tries to encode before to print. If you pass an already
    utf-8 object, it just print it; if not it encode it. All the "print"
    statements pass by MyPrint.write

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-

    import sys

    class MyPrint(object):
    def __init__(self):
    self.old_stdout=sys.stdout
    sys.stdout=self
    def write(self,text):
    try:
    encoded=text.encode("utf8")
    except UnicodeDecodeError:
    encoded=text
    self.old_stdout.write(encoded)


    MyPrint()

    u = u'mo?ambique'
    print u.encode("utf-8")
    print u

    TEST :

    $ ./test.py
    mo?ambique
    mo?ambique

    $ ./test.py > test.txt
    $ cat test.txt
    mo?ambique
    mo?ambique


    By the way, my code will not help for error message. I think that the
    errors are printed by sys.stderr.write. So if you want to do
    raise "mo?ambique"
    you should think about add stderr to the class MyPrint


    If you know French, I strongly recommend "Comprendre les erreurs
    unicode" by Victor Stinner :
    http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

    Have a nice day
    Laurent
  • Laurent Claessens at Jun 10, 2011 at 5:47 am

    Le 09/06/2011 04:18, S?rgio Monteiro Basto a ?crit :
    hi,
    cat test.py
    #!/usr/bin/env python
    #-*- coding: utf-8 -*-
    u = u'mo?ambique'
    print u.encode("utf-8")
    print u >
    chmod +x test.py
    ../test.py
    mo?ambique
    mo?ambique

    The following tries to encode before to print. If you pass an already
    utf-8 object, it just print it; if not it encode it. All the "print"
    statements pass by MyPrint.write

    #!/usr/bin/env python
    #-*- coding: utf-8 -*-

    import sys

    class MyPrint(object):
    def __init__(self):
    self.old_stdout=sys.stdout
    sys.stdout=self
    def write(self,text):
    try:
    encoded=text.encode("utf8")
    except UnicodeDecodeError:
    encoded=text
    self.old_stdout.write(encoded)


    MyPrint()

    u = u'mo?ambique'
    print u.encode("utf-8")
    print u

    TEST :

    $ ./test.py
    mo?ambique
    mo?ambique

    $ ./test.py > test.txt
    $ cat test.txt
    mo?ambique
    mo?ambique


    By the way, my code will not help for error message. I think that the
    errors are printed by sys.stderr.write. So if you want to do
    raise "mo?ambique"
    you should think about add stderr to the class MyPrint


    If you know French, I strongly recommend "Comprendre les erreurs
    unicode" by Victor Stinner :
    http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

    Have a nice day
    Laurent

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJun 9, '11 at 2:18a
activeJun 13, '11 at 2:49p
posts21
users9
websitepython.org

People

Translate

site design / logo © 2022 Grokbase