FAQ
Hi,

I would appreciate if someone could point out what am I doing wrong
here.

Basically, I need to save a string containing non-ascii characters to
a file encoded in utf-8.

If I stay in python, everything seems to work fine, but the moment I
try to read the file with another Windows program, everything goes to
hell.

So here's the script unicode2file.py:
===================================================================
# encoding=utf-8
import codecs

f = codecs.open("m.txt",mode="w", encoding="utf8")
a = u"ma?ana"
print repr(a)
f.write(a)
f.close()

f = codecs.open("m.txt", mode="r", encoding="utf8")
a = f.read()
print repr(a)
f.close()
===================================================================

That gives the expected output, both calls to repr() yield the same
result.

But now, if I do type me.txt in cmd.exe, I get garbled characters
instead of "?".

I then open the file with my editor (Sublime Text), and I see "ma?ana"
normally. I save (nothing to be saved, really), go back to the dos
prompt, do type m.txt and I get again the same garbled characters.

I then open the file m.txt with notepad, and I see "ma?ana" normally.
I save (again, no actual modifications), go back to the dos prompt, do
type m.txt and this time it works! I get "ma?ana". When notepad opens
the file, the encoding is already UTF-8, so short of a UTF-8 bom being
added to the file, I don't know what happens when I save the
unmodified file. Also, I would think that the python script should
save a valid utf-8 file in the first place...

What's going on here?

Regards,
Guillermo

Search Discussions

  • Neil Hodgson at Mar 14, 2010 at 9:05 pm

    Guillermo:

    I then open the file m.txt with notepad, and I see "ma?ana" normally.
    I save (again, no actual modifications), go back to the dos prompt, do
    type m.txt and this time it works! I get "ma?ana". When notepad opens
    the file, the encoding is already UTF-8, so short of a UTF-8 bom being
    added to the file,
    That is what happens: the file now starts with a BOM \xEB\xBB\xBF as
    you can see with a hex editor.
    I don't know what happens when I save the
    unmodified file. Also, I would think that the python script should
    save a valid utf-8 file in the first place...
    Its just as valid UTF-8 without a BOM. People have different opinions
    on this but for compatibility, I think it is best to always start UTF-8
    files with a BOM.

    Neil
  • Guillermo at Mar 14, 2010 at 9:22 pm

    ? ?That is what happens: the file now starts with a BOM \xEB\xBB\xBF as
    you can see with a hex editor.
    Is this an enforced convention under Windows, then? My head's aching
    after so much pulling at my hair, but I have the feeling that the
    problem only arises when text travels through the dos console...

    Cheers,
    Guillermo
  • Joaquin Abian at Mar 14, 2010 at 9:25 pm

    On 14 mar, 22:22, Guillermo wrote:
    ? ?That is what happens: the file now starts with a BOM \xEB\xBB\xBF as
    you can see with a hex editor.
    Is this an enforced convention under Windows, then? My head's aching
    after so much pulling at my hair, but I have the feeling that the
    problem only arises when text travels through the dos console...

    Cheers,
    Guillermo
    search for BOM in wikipedia.
    There it talks about notepad behavior.

    ja
  • Neil Hodgson at Mar 14, 2010 at 9:35 pm

    Guillermo:

    Is this an enforced convention under Windows, then? My head's aching
    after so much pulling at my hair, but I have the feeling that the
    problem only arises when text travels through the dos console...
    The console is commonly using Code Page 437 which is most compatible
    with old DOS programs since it can display line drawing characters. You
    can change the code page to UTF-8 with
    chcp 65001
    Now, "type m.txt" with the original BOM-less file and it should be
    OK. You may also need to change the console font to one that is Unicode
    compatible like Lucida Console.

    Neil
  • Guillermo at Mar 14, 2010 at 9:53 pm

    ? ?The console is commonly using Code Page 437 which is most compatible
    with old DOS programs since it can display line drawing characters. You
    can change the code page to UTF-8 with
    chcp 65001
    That's another issue in my actual script. A twofold problem, actually:

    1) For me chcp gives 850 and I'm relying on that to decode the bytes I
    get back from the console.

    I suppose this is bound to fail because another Windows installation
    might have a different default codepage.

    2) My script gets output from a Popen call (to execute a Powershell
    script [new Windows shell language] from Python; it does make sense!).
    I suppose changing the Windows codepage for a single Popen call isn't
    straightforward/possible?

    Right now, I only get the desired result if I decode the output from
    Popen as "cp850".
  • Neil Hodgson at Mar 14, 2010 at 10:15 pm

    Guillermo:

    2) My script gets output from a Popen call (to execute a Powershell
    script [new Windows shell language] from Python; it does make sense!).
    I suppose changing the Windows codepage for a single Popen call isn't
    straightforward/possible?
    You could try SetConsoleOutputCP and SetConsoleCP.

    Neil
  • Guillermo at Mar 14, 2010 at 10:21 pm

    2) My script gets output from a Popen call (to execute a Powershell
    script [new Windows shell language] from Python; it does make sense!).
    I suppose changing the Windows codepage for a single Popen call isn't
    straightforward/possible?
    Nevermind. I'm able to change Windows' codepage to 65001 from within
    the Powershell script and I get back a string encoded in UTF-8 with
    BOM, so problem solved!

    Thanks for the help,
    Guillermo
  • Terry Reedy at Mar 14, 2010 at 9:37 pm

    On 3/14/2010 4:40 PM, Guillermo wrote:
    Hi,

    I would appreciate if someone could point out what am I doing wrong
    here.

    Basically, I need to save a string containing non-ascii characters to
    a file encoded in utf-8.

    If I stay in python, everything seems to work fine, but the moment I
    try to read the file with another Windows program, everything goes to
    hell.

    So here's the script unicode2file.py:
    ===================================================================
    # encoding=utf-8
    import codecs

    f = codecs.open("m.txt",mode="w", encoding="utf8")
    a = u"ma?ana"
    print repr(a)
    f.write(a)
    f.close()

    f = codecs.open("m.txt", mode="r", encoding="utf8")
    a = f.read()
    print repr(a)
    f.close()
    ===================================================================

    That gives the expected output, both calls to repr() yield the same
    result.

    But now, if I do type me.txt in cmd.exe, I get garbled characters
    instead of "?".

    I then open the file with my editor (Sublime Text), and I see "ma?ana"
    normally. I save (nothing to be saved, really), go back to the dos
    prompt, do type m.txt and I get again the same garbled characters.

    I then open the file m.txt with notepad, and I see "ma?ana" normally.
    I save (again, no actual modifications), go back to the dos prompt, do
    type m.txt and this time it works! I get "ma?ana". When notepad opens
    the file, the encoding is already UTF-8, so short of a UTF-8 bom being
    There is no such thing as a utf-8 'byte order mark'. The concept is an
    oxymoron.
    added to the file, I don't know what happens when I save the
    unmodified file. Also, I would think that the python script should
    save a valid utf-8 file in the first place...
    Adding the byte that some call a 'utf-8 bom' makes the file an invalid
    utf-8 file. However, I suspect that notepad wrote the file in the system
    encoding, which can encode n with tilde and which cmd.exe does
    understand. If you started with a file with encoded cyrillic, arabic,
    hindi, and chinese characters (for instance), I suspect you would get a
    different result.

    tjr
  • Mark Tolonen at Mar 15, 2010 at 12:02 am
    "Terry Reedy" <tjreedy at udel.edu> wrote in message
    news:hnjkuo$n16$1 at dough.gmane.org...
    On 3/14/2010 4:40 PM, Guillermo wrote:
    Adding the byte that some call a 'utf-8 bom' makes the file an invalid
    utf-8 file.
    Not true. From http://unicode.org/faq/utf_bom.html:

    Q: When a BOM is used, is it only in 16-bit Unicode text?
    A: No, a BOM can be used as a signature no matter how the Unicode text is
    transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM
    will be whatever the Unicode character FEFF is converted into by that
    transformation format. In that form, the BOM serves to indicate both that it
    is a Unicode file, and which of the formats it is in. Examples:
    BytesEncoding Form
    00 00 FE FF UTF-32, big-endian
    FF FE 00 00 UTF-32, little-endian
    FE FF UTF-16, big-endian
    FF FE UTF-16, little-endian
    EF BB BF UTF-8

    -Mark
  • Alf P. Steinbach at Mar 15, 2010 at 12:37 am
    * Mark Tolonen:
    "Terry Reedy" <tjreedy at udel.edu> wrote in message
    news:hnjkuo$n16$1 at dough.gmane.org...
    On 3/14/2010 4:40 PM, Guillermo wrote:
    Adding the byte that some call a 'utf-8 bom' makes the file an invalid
    utf-8 file.
    Not true. From http://unicode.org/faq/utf_bom.html:

    Q: When a BOM is used, is it only in 16-bit Unicode text?
    A: No, a BOM can be used as a signature no matter how the Unicode text
    is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising
    the BOM will be whatever the Unicode character FEFF is converted into by
    that transformation format. In that form, the BOM serves to indicate
    both that it is a Unicode file, and which of the formats it is in.
    Examples:
    BytesEncoding Form
    00 00 FE FF UTF-32, big-endian
    FF FE 00 00 UTF-32, little-endian
    FE FF UTF-16, big-endian
    FF FE UTF-16, little-endian
    EF BB BF UTF-8
    Well, technically true, and Terry was wrong about "There is no such thing as a
    utf-8 'byte order mark'. The concept is an oxymoron.". It's true that as a
    descriptive term "byte order mark" is an oxymoron for UTF-8. But in this
    particular context it's not a descriptive term, and it's not only technically
    allowed, as you point out, but sometimes required.

    However, some tools are unable to process UTF-8 files with BOM.

    The most annoying example is the GCC compiler suite, in particular g++, which in
    its Windows MinGW manifestation insists on UTF-8 source code without BOM, while
    Microsoft's compiler needs the BOM to recognize the file as UTF-8 -- the only
    way I found to satisfy both compilers, apart from a restriction to ASCII or
    perhaps Windows ANSI with wide character literals restricted to ASCII
    (exploiting a bug in g++ that lets it handle narrow character literals with
    non-ASCII chars) is to preprocess the source code. But that's not a general
    solution since the g++ preprocessor, via another bug, accepts some constructs
    (which then compile nicely) which the compiler doesn't accept when explicit
    preprocessing isn't used. So it's a mess.


    Cheers,

    - Alf

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 14, '10 at 8:40p
activeMar 15, '10 at 12:37a
posts11
users6
websitepython.org

People

Translate

site design / logo © 2022 Grokbase