FAQ
Dear all,

I've stumbled over a problem with Windows Locale ID information and
codepages. I'm writing a Python application that parses a CSV file,
the format of a line in this file is "LCID;Text1;Text2". Each line can
contain a different locale id (LCID) and the text fields contain data
that is encoded in some codepage which is associated with this LCID. My
current data file contains the codes 1033 for German and 1031 for
English US (as listed in
http://www.microsoft.com/globaldev/reference/lcid-all.mspx).
Unfortunately, I cannot find out which Codepage (like cp-1252 or
whatever) belongs to which LCID.

My question is: How can I convert this data into something more
reasonable like unicode? Basically, what I want is something like
"Text1;Text2", both fields encoded as UTF-8. Can this be done with
Python? How can I find out which codepage I have to use for 1033 and 1031?

Any help appreciated,
Thomas.

Search Discussions

  • Skip at Sep 22, 2008 at 3:35 pm
    Thomas> My question is: How can I convert this data into something more
    Thomas> reasonable like unicode? Basically, what I want is something
    Thomas> like "Text1;Text2", both fields encoded as UTF-8. Can this be
    Thomas> done with Python? How can I find out which codepage I have to
    Thomas> use for 1033 and 1031?

    There are examples at end of the CSV module documentation which show how to
    create Unicode readers and writers. You can extend the UnicodeReader class
    to peek at the LCID field and save the corresponding codepage for the
    remainder of the line. (This would assume you're not creating CSV files
    which contain newlines. Each line read would be assumed to be a new record
    in the file.)

    Skip
  • Tim Golden at Sep 22, 2008 at 3:59 pm

    Thomas Troeger wrote:
    I've stumbled over a problem with Windows Locale ID information and
    codepages. I'm writing a Python application that parses a CSV file,
    the format of a line in this file is "LCID;Text1;Text2". Each line can
    contain a different locale id (LCID) and the text fields contain data
    that is encoded in some codepage which is associated with this LCID. My
    current data file contains the codes 1033 for German and 1031 for
    English US (as listed in
    http://www.microsoft.com/globaldev/reference/lcid-all.mspx).
    Unfortunately, I cannot find out which Codepage (like cp-1252 or
    whatever) belongs to which LCID.

    My question is: How can I convert this data into something more
    reasonable like unicode? Basically, what I want is something like
    "Text1;Text2", both fields encoded as UTF-8. Can this be done with
    Python? How can I find out which codepage I have to use for 1033 and 1031?

    The GetLocaleInfo API call can do that conversion:

    http://msdn.microsoft.com/en-us/library/ms776270(VS.85).aspx

    You'll need to use ctypes (or write a c extension) to
    use it. Be aware that if it doesn't succeed you may need
    to fall back on cp 65001 -- utf8.

    TJG

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedSep 22, '08 at 2:43p
activeSep 22, '08 at 3:59p
posts3
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase