FAQ
I have a program which, among other things, allows the user to popup a
tkSimpleDialog.askstring() dialog box and enter an arbitrary command
they would like to run. Tkinter apparently maps this dialog onto a
native Win32 dialog which prompts the user and returns their entry. I
then issue an os.system() call on that string.

All this works fine most of the time. However, if the user makes reference
to a file- or directory name which contains an 8-bit character (which is
legal under Win32), the Windows dialog returns a *unicode* string - it apparently
assumes that 8-bit data is automatically to be made into unicode, rather than
returning a byte string. This makes os.system fail loudly complaining with
the dreaded: UnicodeError: ASCII encoding error: ordinal not in range(128)

IOW, (and indeed any number of other calls like os.chdir) seem to only
work with real strings - they'll accept byte strings - but not with unicode
strings.

Is there a way to "depromote" a unicode string into a byte string that
the os calls can live with, or am I forced to figure out a low-level
system call (via win32all, I guess) which *should* be able to handle
this case.

Arrrrrrgggghhhhhh...
--
------------------------------------------------------------------------------
Tim Daneliuk
tundra at tundraware.com

Search Discussions

  • Martin v. Löwis at Jan 21, 2003 at 11:41 pm

    Tim Daneliuk <tundra at tundraware.com> writes:

    All this works fine most of the time. However, if the user makes
    reference to a file- or directory name which contains an 8-bit
    character (which is legal under Win32), the Windows dialog returns a
    *unicode* string - it apparently assumes that 8-bit data is
    automatically to be made into unicode, rather than returning a byte
    string.
    Indeed, Tkinter always returns Unicode strings. Internally, it doesn't
    even "know" what the 8-bit representation is (it keeps an UTF-8
    representation also, but this is not the "native" 8-bit representation).
    IOW, (and indeed any number of other calls like os.chdir) seem to only
    work with real strings - they'll accept byte strings - but not with unicode
    strings.
    This will change in Python 2.3. In many cases, Python 2.2 will also
    accept Unicode strings in file system API on Windows. For Python 2.3,
    and NT+, all Unicode strings are usable as file names.

    This still does not include os.system, or environment variables.
    Is there a way to "depromote" a unicode string into a byte string that
    the os calls can live with, or am I forced to figure out a low-level
    system call (via win32all, I guess) which *should* be able to handle
    this case.
    Sure. Just invoke .encode(native encoding) on the Unicode object, to
    obtain a byte string.

    On Windows, using "mbcs" as the native encoding is correct in most
    cases.

    Regards,
    Martin
  • Tim Daneliuk at Jan 22, 2003 at 1:00 am

    Martin v. L?wis wrote:
    Tim Daneliuk <tundra at tundraware.com> writes: <SNIP>
    This will change in Python 2.3. In many cases, Python 2.2 will also
    accept Unicode strings in file system API on Windows. For Python 2.3,
    and NT+, all Unicode strings are usable as file names.

    This still does not include os.system, or environment variables.
    What is the restiction here that prevents 2.3 from doing things
    the same way with these portions of the OS.

    Incidentally, it seem strange to me that Win32 is inherently
    a unicode environment but os.system (which presumably mapps
    to some Win32 API) has trouble with unicode strings...-
    Is there a way to "depromote" a unicode string into a byte string that
    the os calls can live with, or am I forced to figure out a low-level
    system call (via win32all, I guess) which *should* be able to handle
    this case.

    Sure. Just invoke .encode(native encoding) on the Unicode object, to
    obtain a byte string.

    On Windows, using "mbcs" as the native encoding is correct in most
    cases.
    Do you happen to have a URL (or better still, a programmatic method)
    whereby I might determine the native encodings for various systems?

    Vielen Danke ;) for your extensive help in these unicode matters over
    the past couple of days...


    --
    ------------------------------------------------------------------------------
    Tim Daneliuk
    tundra at tundraware.com
  • Martin v. Löwis at Jan 22, 2003 at 1:25 am

    Tim Daneliuk <tundra at tundraware.com> writes:

    This will change in Python 2.3. In many cases, Python 2.2 will also
    accept Unicode strings in file system API on Windows. For Python 2.3,
    and NT+, all Unicode strings are usable as file names.
    This still does not include os.system, or environment variables.
    What is the restiction here that prevents 2.3 from doing things
    the same way with these portions of the OS.
    I have problems parsing this sentence. Is this a question?
    Incidentally, it seem strange to me that Win32 is inherently
    a unicode environment but os.system (which presumably mapps
    to some Win32 API) has trouble with unicode strings...-
    If you are asking why Python 2.3 won't support Unicode strings to
    os.system? Primarily, because nobody has contributed code to do so. Do
    you volunteer?

    Looking more closely, you will find that os.system is a wrapper around
    the C library function system(), which does not support
    Unicode?. Internally (i.e. in the Microsoft C library), system() calls
    the ANSI variants of the Win32 API, which then internally (i.e. in the
    operating system code) call the Unicode variants on NT+.

    Getting rid of these layers of indirection might account to a complete
    reimplementation of the C library part that does os.system. Add to
    that the difficulties of using the Unicode Win32 API on W9x.
    Do you happen to have a URL (or better still, a programmatic method)
    whereby I might determine the native encodings for various systems?
    In Python 2.3, locale.getpreferredencoding() should always return the
    encoding that users are likely to use for text data. It uses:
    - getdefaultlocale()[1] on Windows (and also on the Mac, although
    that isn't implemented for OS X in Python 2.2),
    - locale.nl_langinfo(locale.CODESET) for POSIX systems, provided
    they have both nl_langinfo and CODESET,
    - getdefaultlocale()[1] on POSIX systems if nl_langinfo doesn't work.

    getdefaultlocale, in turn, uses:
    - GetACP() on Windows (printing it as cp%d),
    - GetScriptVariable(script, smScriptLang) on Mac OS 9,
    - CFStringGetSystemEncoding() (potentially followed
    by CFStringConvertEncodingToIANACharSetName()) on Mac OS X?
    - environment variables on all other systems.

    Notice that, on Windows, there are *two* native encodings: the ANSI
    code page (what the ANSI Win32 API expects, and which is used in the
    windowing system), and the OEM code page (which the FAT file system
    uses on disk, and the command.com/cmd.exe terminal windows, unless
    setcp.exe is invoked)

    Also notice that, on OS X, the encoding used on for file names is
    always UTF-8, regardless of what CFStringGetSystemEncoding() returns.
    This is true atleast for the BSD POSIX layer of API calls; higher
    layer API calls may use different encodings.

    Regards,
    Martin

    ? There might be a Microsoft extension _wsystem; I haven't checked.
    ? for Python 2.3 only
  • Tim Daneliuk at Jan 22, 2003 at 3:00 am

    Martin v. L?wis wrote:
    Tim Daneliuk <tundra at tundraware.com> writes:

    This will change in Python 2.3. In many cases, Python 2.2 will also
    accept Unicode strings in file system API on Windows. For Python 2.3,
    and NT+, all Unicode strings are usable as file names.
    This still does not include os.system, or environment variables.
    What is the restiction here that prevents 2.3 from doing things
    the same way with these portions of the OS.

    I have problems parsing this sentence. Is this a question?

    Sorry - my lousy English is almost as bad as my terrible German ;))
    Incidentally, it seem strange to me that Win32 is inherently
    a unicode environment but os.system (which presumably mapps
    to some Win32 API) has trouble with unicode strings...-

    If you are asking why Python 2.3 won't support Unicode strings to
    os.system? Primarily, because nobody has contributed code to do so. Do
    you volunteer?
    Ummm, no - I much prefer writing for Unix or realtime systems. Working
    at this low a level in Win32 gives me nightmares ...
    <SNIP>
    Notice that, on Windows, there are *two* native encodings: the ANSI
    code page (what the ANSI Win32 API expects, and which is used in the
    windowing system), and the OEM code page (which the FAT file system
    uses on disk, and the command.com/cmd.exe terminal windows, unless
    setcp.exe is invoked)
    (Clearly) I am not too familiar with this, so I ran the commands
    as you suggest and got ('en_US', 'cp1252') just as you've explained.
    So... where does 'mcbs' come from? That is, why is the translation
    from unicode to bytestring not:

    y = encode(unicode-var, "cp1252")

    or conversely

    u = unicode(byte-var, "cp1252")

    Just wondering ...

    --
    ------------------------------------------------------------------------------
    Tim Daneliuk
    tundra at tundraware.com
  • Martin v. Löwis at Jan 22, 2003 at 8:15 am

    Tim Daneliuk <tundra at tundraware.com> writes:

    (Clearly) I am not too familiar with this, so I ran the commands
    as you suggest and got ('en_US', 'cp1252') just as you've explained.
    So... where does 'mcbs' come from? That is, why is the translation
    from unicode to bytestring not:

    y = encode(unicode-var, "cp1252")

    or conversely

    u = unicode(byte-var, "cp1252")
    "mbcs" is a codec which internally does MultiByteToWideChar with
    CP_ACP. I.e. it converts to the "ANSI code page", which is a code page
    alias that depends on the Windows installation. It defaults to some
    Microsoft provided factory default, which is depending on the language:

    code page region
    1250 Central Europe
    1251 Cyrillic
    1252 Western Europe
    1253 Greek
    1254 Turkish
    1255 Hebrew
    1256 Arabic
    1257 Baltic
    1258 Vietnamese

    874 Thai
    932 Japan
    936 Simplified Chinese
    949 Korea
    950 Traditional Chinese

    On WXP, this default can be changed by the Administrator. So encoding
    Unicode strings with cp1252 is only correct on a Western-Europe
    installation of Windows, elsewhere it would give confusing results.

    In addition, the "cp1252" codec is a Python-provided one. Python
    currently does not provide CJK codecs, so you can't use, say "cp932" as
    an encoding name on Windows. However, if cp932 happens to be the ANSI
    code page, then you can use "mbcs" to access that encoding.

    Regards,
    Martin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJan 21, '03 at 11:20p
activeJan 22, '03 at 8:15a
posts6
users2
websitepython.org

2 users in discussion

Tim Daneliuk: 3 posts Martin v. Löwis: 3 posts

People

Translate

site design / logo © 2022 Grokbase