FAQ
Greetings, all!

I would like to add unicode support to my dbf project. The dbf header
has a one-byte field to hold the encoding of the file. For example,
\x03 is code-page 437 MS-DOS.

My google-fu is apparently not up to the task of locating a complete
resource that has a list of the 256 possible values and their
corresponding code pages.

So far I have found this, plus variations:
http://support.microsoft.com/kb/129631

Does anyone know of anything more complete?

~Ethan~

Search Discussions

  • John Machin at Oct 22, 2009 at 11:46 pm

    On Oct 23, 7:28?am, Ethan Furman wrote:
    Greetings, all!

    I would like to add unicode support to my dbf project. ?The dbf header
    has a one-byte field to hold the encoding of the file. ?For example,
    \x03 is code-page 437 MS-DOS.

    My google-fu is apparently not up to the task of locating a complete
    resource that has a list of the 256 possible values and their
    corresponding code pages.
    What makes you imagine that all 256 possible values are mapped to code
    pages?
    So far I have found this, plus variations:http://support.microsoft.com/kb/129631

    Does anyone know of anything more complete?
    That is for VFP3. Try the VFP9 equivalent.

    dBase 5,5,6,7 use others which are not defined in publicly available
    dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
    source: ESRI support site.
  • Ethan Furman at Oct 23, 2009 at 4:03 am

    John Machin wrote:
    On Oct 23, 7:28 am, Ethan Furman wrote:

    Greetings, all!

    I would like to add unicode support to my dbf project. The dbf header
    has a one-byte field to hold the encoding of the file. For example,
    \x03 is code-page 437 MS-DOS.

    My google-fu is apparently not up to the task of locating a complete
    resource that has a list of the 256 possible values and their
    corresponding code pages.
    What makes you imagine that all 256 possible values are mapped to code
    pages?
    I'm just wanting to make sure I have whatever is available, and
    preferably standard. :D

    So far I have found this, plus variations:http://support.microsoft.com/kb/129631

    Does anyone know of anything more complete?
    That is for VFP3. Try the VFP9 equivalent.

    dBase 5,5,6,7 use others which are not defined in publicly available
    dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
    source: ESRI support site.
    Well, a couple hours later and still not more than I started with.
    Thanks for trying, though!

    ~Ethan~
  • John Machin at Oct 23, 2009 at 7:42 am

    On Oct 23, 3:03?pm, Ethan Furman wrote:
    John Machin wrote:
    On Oct 23, 7:28 am, Ethan Furman wrote:

    Greetings, all!
    I would like to add unicode support to my dbf project. ?The dbf header
    has a one-byte field to hold the encoding of the file. ?For example,
    \x03 is code-page 437 MS-DOS.
    My google-fu is apparently not up to the task of locating a complete
    resource that has a list of the 256 possible values and their
    corresponding code pages.
    What makes you imagine that all 256 possible values are mapped to code
    pages?
    I'm just wanting to make sure I have whatever is available, and
    preferably standard. ?:D
    So far I have found this, plus variations:http://support.microsoft.com/kb/129631
    Does anyone know of anything more complete?
    That is for VFP3. Try the VFP9 equivalent.
    dBase 5,5,6,7 use others which are not defined in publicly available
    dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
    source: ESRI support site.
    Well, a couple hours later and still not more than I started with.
    Thanks for trying, though!
    Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
    keywords and you couldn't come up with anything??
  • Ethan Furman at Oct 23, 2009 at 5:14 pm

    John Machin wrote:
    On Oct 23, 3:03 pm, Ethan Furman wrote:

    John Machin wrote:
    On Oct 23, 7:28 am, Ethan Furman wrote:

    Greetings, all!
    I would like to add unicode support to my dbf project. The dbf header
    has a one-byte field to hold the encoding of the file. For example,
    \x03 is code-page 437 MS-DOS.
    My google-fu is apparently not up to the task of locating a complete
    resource that has a list of the 256 possible values and their
    corresponding code pages.
    What makes you imagine that all 256 possible values are mapped to code
    pages?
    I'm just wanting to make sure I have whatever is available, and
    preferably standard. :D

    So far I have found this, plus variations:http://support.microsoft.com/kb/129631
    Does anyone know of anything more complete?
    That is for VFP3. Try the VFP9 equivalent.
    dBase 5,5,6,7 use others which are not defined in publicly available
    dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
    source: ESRI support site.
    Well, a couple hours later and still not more than I started with.
    Thanks for trying, though!

    Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
    keywords and you couldn't come up with anything??
    Perhaps "nothing new" would have been a better description. I'd already
    seen the clicketyclick site (good info there), and all I found at ESRI
    were folks trying to figure it out, plus one link to a list that was no
    different from the vfp3 list (or was it that the list did not give the
    hex values? Either way, of no use to me.)

    I looked at dbase.com, but came up empty-handed there (not surprising,
    since they are a commercial company).

    I searched some more on Microsoft's site in the VFP9 section, and was
    able to find the code page section this time. Sadly, it only added
    about seven codes.

    At any rate, here is what I have come up with so far. Any corrections
    and/or additions greatly appreciated.

    code_pages = {
    '\x01' : ('ascii', 'U.S. MS-DOS'),
    '\x02' : ('cp850', 'International MS-DOS'),
    '\x03' : ('cp1252', 'Windows ANSI'),
    '\x04' : ('mac_roman', 'Standard Macintosh'),
    '\x64' : ('cp852', 'Eastern European MS-DOS'),
    '\x65' : ('cp866', 'Russian MS-DOS'),
    '\x66' : ('cp865', 'Nordic MS-DOS'),
    '\x67' : ('cp861', 'Icelandic MS-DOS'),
    '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
    '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy
    '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
    '\x6b' : ('cp857', 'Turkish MS-DOS'),

    '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\
    Windows'), # wag
    '\x79' : ('iso2022_kr', 'Korean Windows'), # wag
    '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
    Windows'), # wag
    '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
    '\x7c' : ('cp874', 'Thai Windows'), # wag
    '\x7d' : ('cp1255', 'Hebrew Windows'),
    '\x7e' : ('cp1256', 'Arabic Windows'),
    '\xc8' : ('cp1250', 'Eastern European Windows'),
    '\xc9' : ('cp1251', 'Russian Windows'),
    '\xca' : ('cp1254', 'Turkish Windows'),
    '\xcb' : ('cp1253', 'Greek Windows'),
    '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
    '\x97' : ('mac_latin2', 'Macintosh EE'),
    '\x98' : ('mac_greek', 'Greek Macintosh') }

    ~Ethan~
  • John Machin at Oct 24, 2009 at 10:58 am

    On Oct 24, 4:14?am, Ethan Furman wrote:
    John Machin wrote:
    On Oct 23, 3:03 pm, Ethan Furman wrote:

    John Machin wrote:
    On Oct 23, 7:28 am, Ethan Furman wrote:

    Greetings, all!
    I would like to add unicode support to my dbf project. ?The dbf header
    has a one-byte field to hold the encoding of the file. ?For example,
    \x03 is code-page 437 MS-DOS.
    My google-fu is apparently not up to the task of locating a complete
    resource that has a list of the 256 possible values and their
    corresponding code pages.
    What makes you imagine that all 256 possible values are mapped to code
    pages?
    I'm just wanting to make sure I have whatever is available, and
    preferably standard. ?:D
    So far I have found this, plus variations:http://support.microsoft.com/kb/129631
    Does anyone know of anything more complete?
    That is for VFP3. Try the VFP9 equivalent.
    dBase 5,5,6,7 use others which are not defined in publicly available
    dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
    source: ESRI support site.
    Well, a couple hours later and still not more than I started with.
    Thanks for trying, though!
    Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
    keywords and you couldn't come up with anything??
    Perhaps "nothing new" would have been a better description. ?I'd already
    seen the clicketyclick site (good info there)
    Do you think so? My take is that it leaves out most of the codepage
    numbers, and these two lines are wrong:
    65h Nordic MS-DOS code page 865
    66h Russian MS-DOS code page 866

    and all I found at ESRI
    were folks trying to figure it out, plus one link to a list that was no
    different from the vfp3 list (or was it that the list did not give the
    hex values? ?Either way, of no use to me.)
    Try this:
    http://webhelp.esri.com/arcpad/8.0/referenceguide/

    I looked at dbase.com, but came up empty-handed there (not surprising,
    since they are a commercial company).
    MS and ESRI have docs ... does that mean that they are non-commercial
    companies?
    I searched some more on Microsoft's site in the VFP9 section, and was
    able to find the code page section this time. ?Sadly, it only added
    about seven codes.

    At any rate, here is what I have come up with so far. ?Any corrections
    and/or additions greatly appreciated.

    code_pages = {
    ? ? ?'\x01' : ('ascii', 'U.S. MS-DOS'),
    All of the sources say codepage 437, so why ascii instead of cp437?
    ? ? ?'\x02' : ('cp850', 'International MS-DOS'),
    ? ? ?'\x03' : ('cp1252', 'Windows ANSI'),
    ? ? ?'\x04' : ('mac_roman', 'Standard Macintosh'),
    ? ? ?'\x64' : ('cp852', 'Eastern European MS-DOS'),
    ? ? ?'\x65' : ('cp866', 'Russian MS-DOS'),
    ? ? ?'\x66' : ('cp865', 'Nordic MS-DOS'),
    ? ? ?'\x67' : ('cp861', 'Icelandic MS-DOS'),
    ? ? ?'\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), ? ? # iffy
    Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
    not alone. I suggest that you omit Kamenicky until someone actually
    wants it.
    ? ? ?'\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), ? ? ?# iffy
    Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
    predates and is not the same as cp852. In any case, I suggest that you
    omit Masovia until someone wants it. Interesting reading:

    http://www.jastra.com.pl/klub/ogonki.htm
    ? ? ?'\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
    ? ? ?'\x6b' : ('cp857', 'Turkish MS-DOS'),
    ? ? ?'\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\
    big5 is *not* the same as cp950. The products that create DBF files
    were designed for Windows. So when your source says that LDID 0xXX
    maps to Windows codepage YYY, I would suggest that all you should do
    is translate that without thinking to python encoding cpYYY.
    ? ? ? ? ? ? ? ? Windows'), ? ? ? # wag
    What does "wag" mean?
    ? ? ?'\x79' : ('iso2022_kr', 'Korean Windows'), ? ? ? ? ?# wag
    Try cp949.

    ? ? ?'\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
    ? ? ? ? ? ? ? ? Windows'), ? ? ? # wag
    Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
    (1980) Chinese (GB2312) and a basic Korean kit. However to quote from
    "CJKV Information Processing" by Ken Lunde, "... from a practical
    point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
    encoding." i.e. no Chinese support at all. Try cp936.
    ? ? ?'\x7b' : ('iso2022_jp', 'Japanese Windows'), ? ? ? ?# wag
    Try cp936.
    ? ? ?'\x7c' : ('cp874', 'Thai Windows'), ? ? ? ? ? ? ? ? # wag
    ? ? ?'\x7d' : ('cp1255', 'Hebrew Windows'),
    ? ? ?'\x7e' : ('cp1256', 'Arabic Windows'),
    ? ? ?'\xc8' : ('cp1250', 'Eastern European Windows'),
    ? ? ?'\xc9' : ('cp1251', 'Russian Windows'),
    ? ? ?'\xca' : ('cp1254', 'Turkish Windows'),
    ? ? ?'\xcb' : ('cp1253', 'Greek Windows'),
    ? ? ?'\x96' : ('mac_cyrillic', 'Russian Macintosh'),
    ? ? ?'\x97' : ('mac_latin2', 'Macintosh EE'),
    ? ? ?'\x98' : ('mac_greek', 'Greek Macintosh') }
    HTH,
    John
  • Ethan Furman at Oct 26, 2009 at 4:22 pm

    John Machin wrote:
    On Oct 24, 4:14 am, Ethan Furman wrote:

    John Machin wrote:
    On Oct 23, 3:03 pm, Ethan Furman wrote:

    John Machin wrote:
    On Oct 23, 7:28 am, Ethan Furman wrote:

    Greetings, all!
    I would like to add unicode support to my dbf project. The dbf header
    has a one-byte field to hold the encoding of the file. For example,
    \x03 is code-page 437 MS-DOS.
    My google-fu is apparently not up to the task of locating a complete
    resource that has a list of the 256 possible values and their
    corresponding code pages.
    What makes you imagine that all 256 possible values are mapped to code
    pages?
    I'm just wanting to make sure I have whatever is available, and
    preferably standard. :D
    So far I have found this, plus variations:http://support.microsoft.com/kb/129631
    Does anyone know of anything more complete?
    That is for VFP3. Try the VFP9 equivalent.
    dBase 5,5,6,7 use others which are not defined in publicly available
    dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
    source: ESRI support site.
    Well, a couple hours later and still not more than I started with.
    Thanks for trying, though!
    Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
    keywords and you couldn't come up with anything??
    Perhaps "nothing new" would have been a better description. I'd already
    seen the clicketyclick site (good info there)

    Do you think so? My take is that it leaves out most of the codepage
    numbers, and these two lines are wrong:
    65h Nordic MS-DOS code page 865
    66h Russian MS-DOS code page 866
    That was the site I used to get my whole project going, so ignoring the
    unicode aspect, it has been very helpful to me.

    and all I found at ESRI
    were folks trying to figure it out, plus one link to a list that was no
    different from the vfp3 list (or was it that the list did not give the
    hex values? Either way, of no use to me.)

    Try this:
    http://webhelp.esri.com/arcpad/8.0/referenceguide/
    Wow. Question, though: all those codepages mapping to 437 and 850 --
    are they really all the same?

    I looked at dbase.com, but came up empty-handed there (not surprising,
    since they are a commercial company).

    MS and ESRI have docs ... does that mean that they are non-commercial
    companies?
    I don't know enough about ESRI to make an informed comment, so I'll just
    say I'm grateful they have them! MS is a complete mystery... perhaps
    they are finally seeing the light? Hard to believe, though, from a
    company that has consistently changed their file formats with every release.

    I searched some more on Microsoft's site in the VFP9 section, and was
    able to find the code page section this time. Sadly, it only added
    about seven codes.

    At any rate, here is what I have come up with so far. Any corrections
    and/or additions greatly appreciated.

    code_pages = {
    '\x01' : ('ascii', 'U.S. MS-DOS'),

    All of the sources say codepage 437, so why ascii instead of cp437?
    Hard to say, really. Adjusted.

    '\x02' : ('cp850', 'International MS-DOS'),
    '\x03' : ('cp1252', 'Windows ANSI'),
    '\x04' : ('mac_roman', 'Standard Macintosh'),
    '\x64' : ('cp852', 'Eastern European MS-DOS'),
    '\x65' : ('cp866', 'Russian MS-DOS'),
    '\x66' : ('cp865', 'Nordic MS-DOS'),
    '\x67' : ('cp861', 'Icelandic MS-DOS'),
    '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy

    Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
    not alone. I suggest that you omit Kamenicky until someone actually
    wants it.
    Yeah, I noticed that. Tentative plan was to implement it myself (more
    for practice than anything else), and also to be able to raise a more
    specific error ("Kamenicky not currently supported" or some such).

    '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy

    Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
    predates and is not the same as cp852. In any case, I suggest that you
    omit Masovia until someone wants it. Interesting reading:

    http://www.jastra.com.pl/klub/ogonki.htm
    Very interesting reading.

    '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
    '\x6b' : ('cp857', 'Turkish MS-DOS'),
    '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\

    big5 is *not* the same as cp950. The products that create DBF files
    were designed for Windows. So when your source says that LDID 0xXX
    maps to Windows codepage YYY, I would suggest that all you should do
    is translate that without thinking to python encoding cpYYY.
    Ack. Not sure how I missed 'Windows' at the end of that description.

    Windows'), # wag
    What does "wag" mean?
    wag == 'wild ass guess'

    '\x79' : ('iso2022_kr', 'Korean Windows'), # wag
    Try cp949.
    Done.

    '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
    Windows'), # wag

    Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
    (1980) Chinese (GB2312) and a basic Korean kit. However to quote from
    "CJKV Information Processing" by Ken Lunde, "... from a practical
    point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
    encoding." i.e. no Chinese support at all. Try cp936.
    Done.

    '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag

    Try cp936.
    You mean 932?

    '\x7c' : ('cp874', 'Thai Windows'), # wag
    '\x7d' : ('cp1255', 'Hebrew Windows'),
    '\x7e' : ('cp1256', 'Arabic Windows'),
    '\xc8' : ('cp1250', 'Eastern European Windows'),
    '\xc9' : ('cp1251', 'Russian Windows'),
    '\xca' : ('cp1254', 'Turkish Windows'),
    '\xcb' : ('cp1253', 'Greek Windows'),
    '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
    '\x97' : ('mac_latin2', 'Macintosh EE'),
    '\x98' : ('mac_greek', 'Greek Macintosh') }

    HTH,
    John

    Very helpful indeed. Many thanks for reviewing and correcting.
    Learning to deal with unicode is proving more difficult for me than
    learning Python was to begin with! ;D

    ~Ethan~
  • John Machin at Oct 26, 2009 at 7:21 pm

    On Oct 27, 3:22?am, Ethan Furman wrote:
    John Machin wrote:
    On Oct 24, 4:14 am, Ethan Furman wrote:

    John Machin wrote:
    On Oct 23, 3:03 pm, Ethan Furman wrote:

    John Machin wrote:
    On Oct 23, 7:28 am, Ethan Furman wrote:
    Try this:
    http://webhelp.esri.com/arcpad/8.0/referenceguide/
    Wow. ?Question, though: ?all those codepages mapping to 437 and 850 --
    are they really all the same?
    437 and 850 *are* codepages. You mean "all those language driver IDs
    mapping to codepages 437 and 850". A codepage merely gives an
    encoding. An LDID is like a locale; it includes other things besides
    the encoding. That's why many Western European languages map to the
    same codepage, first 437 then later 850 then 1252 when Windows came
    along.
    ? ? '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), ? ? # iffy
    Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
    not alone. I suggest that you omit Kamenicky until someone actually
    wants it.
    Yeah, I noticed that. ?Tentative plan was to implement it myself (more
    for practice than anything else), and also to be able to raise a more
    specific error ("Kamenicky not currently supported" or some such).
    The error idea is fine, but I don't get the "implement it yourself for
    practice" bit ... practice what? You plan a long and fruitful career
    inplementing codecs for YAGNI codepages?
    ? ? '\x7b' : ('iso2022_jp', 'Japanese Windows'), ? ? ? ?# wag
    Try cp936.
    You mean 932? Yes.
    Very helpful indeed. ?Many thanks for reviewing and correcting.
    You're welcome.
    Learning to deal with unicode is proving more difficult for me than
    learning Python was to begin with! ?;D
    ?? As far as I can tell, the topic has been about mapping from
    something like a locale to the name of an encoding, i.e. all about the
    pre-Unicode mishmash and nothing to do with dealing with unicode ...

    BTW, what are you planning to do with an LDID of 0x00?

    Cheers,

    John
  • Ethan Furman at Oct 26, 2009 at 8:15 pm

    John Machin wrote:
    On Oct 27, 3:22 am, Ethan Furman wrote:

    John Machin wrote:
    Wow. Question, though: all those codepages mapping to 437 and 850 --
    are they really all the same?
    437 and 850 *are* codepages. You mean "all those language driver IDs
    mapping to codepages 437 and 850". A codepage merely gives an
    encoding. An LDID is like a locale; it includes other things besides
    the encoding. That's why many Western European languages map to the
    same codepage, first 437 then later 850 then 1252 when Windows came
    along.
    Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
    to a cp437, and the file came from a german oem machine... could that
    file have upper-ascii codes that will not map to anything reasonable on
    my \x01 cp437 machine? If so, is there anything I can do about it?

    '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
    Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
    not alone. I suggest that you omit Kamenicky until someone actually
    wants it.
    Yeah, I noticed that. Tentative plan was to implement it myself (more
    for practice than anything else), and also to be able to raise a more
    specific error ("Kamenicky not currently supported" or some such).

    The error idea is fine, but I don't get the "implement it yourself for
    practice" bit ... practice what? You plan a long and fruitful career
    inplementing codecs for YAGNI codepages?
    ROFL. Playing with code; the unicode/code page interactions. Possibly
    looking at constructs I might not otherwise. Since this would almost
    certainly (I don't like saying "absolutely" and "never" -- been
    troubleshooting for too many years for that!-) be a YAGNI, implementing
    it is very low priority

    '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
    Try cp936.
    You mean 932?

    Yes.

    Very helpful indeed. Many thanks for reviewing and correcting.

    You're welcome.

    Learning to deal with unicode is proving more difficult for me than
    learning Python was to begin with! ;D

    ?? As far as I can tell, the topic has been about mapping from
    something like a locale to the name of an encoding, i.e. all about the
    pre-Unicode mishmash and nothing to do with dealing with unicode ...
    You are, of course, correct. Once it's all unicode life will be easier
    (he says, all innocent-like). And dbf files even bigger, lol.

    BTW, what are you planning to do with an LDID of 0x00?
    Hmmm. Well, logical choices seem to be either treating it as plain
    ascii, and barfing when high-ascii shows up; defaulting to \x01; or
    forcing the user to choose one on initial access.

    I am definitely open to ideas!

    Cheers,

    John
  • John Machin at Oct 27, 2009 at 12:38 am

    On Oct 27, 7:15?am, Ethan Furman wrote:
    John Machin wrote:
    On Oct 27, 3:22 am, Ethan Furman wrote:

    John Machin wrote:
    Wow. ?Question, though: ?all those codepages mapping to 437 and 850 --
    are they really all the same?
    437 and 850 *are* codepages. You mean "all those language driver IDs
    mapping to codepages 437 and 850". A codepage merely gives an
    encoding. An LDID is like a locale; it includes other things besides
    the encoding. That's why many Western European languages map to the
    same codepage, first 437 then later 850 then 1252 when Windows came
    along.
    Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
    to a cp437, and the file came from a german oem machine... could that
    file have upper-ascii codes that will not map to anything reasonable on
    my \x01 cp437 machine? ?If so, is there anything I can do about it?
    ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
    meaningless. As for the rest of your question, if the file's encoded
    in cpXXX, it's encoded in cpXXX. If either the creator or the reader
    or both are lying, then all bets are off.
    BTW, what are you planning to do with an LDID of 0x00?
    Hmmm. ?Well, logical choices seem to be either treating it as plain
    ascii, and barfing when high-ascii shows up; defaulting to \x01; or
    forcing the user to choose one on initial access.
    It would be more useful to allow the user to specify an encoding than
    an LDID.

    You need to be able to read files created not only by software like
    VFP or dBase but also scripts using third-party libraries. It would be
    useful to allow an encoding to override an LDID that is incorrect e.g.
    the LDID implies cp1251 but the data is actually encoded in koi8[ru]

    Read this: http://en.wikipedia.org/wiki/Code_page_437
    With no LDID in the file and no encoding supplied, I'd be inclined to
    make it barf if any codepoint not in range(32, 128) showed up.

    Cheers,
    John
  • Ethan Furman at Oct 27, 2009 at 3:51 pm

    John Machin wrote:
    On Oct 27, 7:15 am, Ethan Furman wrote:

    Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
    to a cp437, and the file came from a german oem machine... could that
    file have upper-ascii codes that will not map to anything reasonable on
    my \x01 cp437 machine? If so, is there anything I can do about it?
    ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
    meaningless. As for the rest of your question, if the file's encoded
    in cpXXX, it's encoded in cpXXX. If either the creator or the reader
    or both are lying, then all bets are off.
    My confusion is this -- is there a difference between any of the various
    cp437s? Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f,
    0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437, and they have names
    such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish,
    English (Britain & US)... are these all the same?

    BTW, what are you planning to do with an LDID of 0x00?
    Hmmm. Well, logical choices seem to be either treating it as plain
    ascii, and barfing when high-ascii shows up; defaulting to \x01; or
    forcing the user to choose one on initial access.
    It would be more useful to allow the user to specify an encoding than
    an LDID.
    I plan on using the same technique used in xlrd and xlwt, and allowing
    an encoding to be specified when the table is opened. If not specified,
    it will use whatever the table has in the LDID field.

    You need to be able to read files created not only by software like
    VFP or dBase but also scripts using third-party libraries. It would be
    useful to allow an encoding to override an LDID that is incorrect e.g.
    the LDID implies cp1251 but the data is actually encoded in koi8[ru]

    Read this: http://en.wikipedia.org/wiki/Code_page_437
    With no LDID in the file and no encoding supplied, I'd be inclined to
    make it barf if any codepoint not in range(32, 128) showed up.
    Sounds reasonable -- especially when the encoding can be overridden.

    ~Ethan~
  • John Machin at Oct 28, 2009 at 3:46 am

    On Oct 28, 2:51?am, Ethan Furman wrote:
    John Machin wrote:
    On Oct 27, 7:15 am, Ethan Furman wrote:

    Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
    to a cp437, and the file came from a german oem machine... could that
    file have upper-ascii codes that will not map to anything reasonable on
    my \x01 cp437 machine? ?If so, is there anything I can do about it?
    ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
    meaningless. As for the rest of your question, if the file's encoded
    in cpXXX, it's encoded in cpXXX. If either the creator or the reader
    or both are lying, then all bets are off.
    My confusion is this -- is there a difference between any of the various
    cp437s?
    What various cp437s???
    ?Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f,
    0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437,
    Yes, this is called a "many-to-*one*" relationship.
    and they have names
    "they" being the Language Drivers, not the codepages.
    such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish,
    English (Britain & US)... are these all the same?
    When you read the Wikipedia page on cp437, did you see any reference
    to different versions for French, German, Finnish, etc? I saw only one
    mapping table; how many did you see? If there are multiple language
    versions of a codepage, how do you expect to handle this given Python
    has only one codec per codepage?

    Trying again: *ONE* attribute of a Language Driver ID (LDID) is the
    character set (codepage) that it uses. Other attributes may be things
    like the collating (sorting) sequence, whether they use a dot or a
    comma as the decimal point, etc. Many different languages in Western
    Europe can use the same codepage. Initially the common one was cp 437,
    then 850, then 1252.

    There may possibly different interpretations of a codepage out there
    somewhere, but they are all *intended* to be the same, and I advise
    you to cross the different-cp437s bridge *if* it exists and you ever
    come to it.

    Have you got access to files with LDID not in (0, 1) that you can try
    out?

    Cheers,
    John
  • Ethan Furman at Oct 28, 2009 at 4:59 am

    John Machin wrote:
    There may possibly different interpretations of a codepage out there
    somewhere, but they are all *intended* to be the same, and I advise
    you to cross the different-cp437s bridge *if* it exists and you ever
    come to it.

    Have you got access to files with LDID not in (0, 1) that you can try
    out?
    Alas, I do not. And I probably never will, making the whole thing academic.

    Speaking of tables I do not have access to, and documentation for that
    matter, I would love to get information on db4, 5, 7, etc.

    Many thanks for your time and knowledge, and my apologies for seeming so
    dense. :)

    Cheers!

    ~Ethan~

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 22, '09 at 8:28p
activeOct 28, '09 at 4:59a
posts13
users2
websitepython.org

2 users in discussion

Ethan Furman: 7 posts John Machin: 6 posts

People

Translate

site design / logo © 2022 Grokbase