FAQ
hi folks,
I am puzzled by unicode generally, and within the context of python
specifically. For one thing, what do we mean that unicode is used in
python 3.x by default. (I know what default means, I mean, what changed?)

I think part of my problem is that I'm spoiled (American, ascii
heritage) and have been either stuck in ascii knowingly, or UTF-8
without knowing (just because the code points lined up). I am confused
by the implications for using 3.x, because I am reading that there are
significant things to be aware of... what?

On my installation 2.6 sys.maxunicode comes up with 1114111, and my
2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6
was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that
the default compile option for 2.7 & 3.2 (I didn't change anything) is
set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much
correctly?

The books say that the .py sources are UTF-8 by default... and that
3.x is either UCS-2 or UCS-4. If I use the file handling capabilities
of Python in 3.x (by default) what encoding will be used, and how will
that affect the output?

If I do not specify any code points above ascii 0xFF does any of
this matter anyway?



Thanks.

kind regards,
m harris

Search Discussions

  • Ian Kelly at May 11, 2011 at 10:09 pm

    On Wed, May 11, 2011 at 3:37 PM, harrismh777 wrote:
    hi folks,
    ? I am puzzled by unicode generally, and within the context of python
    specifically. For one thing, what do we mean that unicode is used in python
    3.x by default. (I know what default means, I mean, what changed?)
    The `unicode' class was renamed to `str', and a stripped-down version
    of the 2.X `str' class was renamed to `bytes'.
    ? I think part of my problem is that I'm spoiled (American, ascii heritage)
    and have been either stuck in ascii knowingly, or UTF-8 without knowing
    (just because the code points lined up). I am confused by the implications
    for using 3.x, because I am reading that there are significant things to be
    aware of... what?
    Mainly Python 3 no longer does explicit conversion between bytes and
    unicode, requiring the programmer to be explicit about such
    conversions. If you have Python 2 code that is sloppy about this, you
    may get some Unicode encode/decode errors when trying to run the same
    code in Python 3. The 2to3 tool can help somewhat with this, but it
    can't prevent all problems.
    ? On my installation 2.6 ?sys.maxunicode comes up with 1114111, and my 2.7
    and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
    compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
    default compile option for 2.7 & 3.2 (I didn't change anything) is set for
    UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?
    I think that UCS-2 has always been the default unicode width for
    CPython, although the exact representation used internally is an
    implementation detail.
    ? The books say that the .py sources are UTF-8 by default... and that 3.x is
    either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
    3.x (by default) what encoding will be used, and how will that affect the
    output?
    If you open a file in binary mode, the result is a non-decoded byte stream.

    If you open a file in text mode and do not specify an encoding, then
    the result of locale.getpreferredencoding() is used for decoding, and
    the result is a unicode stream.
    ? If I do not specify any code points above ascii 0xFF does any of this
    matter anyway?
    You mean 0x7F, and probably, due to the need to explicitly encode and decode.
  • Benjamin Kaplan at May 11, 2011 at 10:34 pm

    On Wed, May 11, 2011 at 2:37 PM, harrismh777 wrote:
    hi folks,
    ? I am puzzled by unicode generally, and within the context of python
    specifically. For one thing, what do we mean that unicode is used in python
    3.x by default. (I know what default means, I mean, what changed?)

    ? I think part of my problem is that I'm spoiled (American, ascii heritage)
    and have been either stuck in ascii knowingly, or UTF-8 without knowing
    (just because the code points lined up). I am confused by the implications
    for using 3.x, because I am reading that there are significant things to be
    aware of... what?

    ? On my installation 2.6 ?sys.maxunicode comes up with 1114111, and my 2.7
    and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
    compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
    default compile option for 2.7 & 3.2 (I didn't change anything) is set for
    UCS-2 (UTF-16) or 2 byte unicode(?). ? Do I understand this much correctly?
    Not really sure about that, but it doesn't matter anyway. Because even
    though internally the string is stored as either a UCS-2 or a UCS-4
    string, you never see that. You just see this string as a sequence of
    characters. If you want to turn it into a sequence of bytes, you have
    to use an encoding.
    ? The books say that the .py sources are UTF-8 by default... and that 3.x is
    either UCS-2 or UCS-4. ?If I use the file handling capabilities of Python in
    3.x (by default) what encoding will be used, and how will that affect the
    output?

    ? If I do not specify any code points above ascii 0xFF does any of this
    matter anyway?
    ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
    there is a difference for anything over that range. A byte string is a
    sequence of bytes. A unicode string is a sequence of these mythical
    abstractions called characters. So a unicode string u'\u00a0' will
    have a length of 1. Encode that to UTF-8 and you'll find it has a
    length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
    the top bit is used to signal that you need the next byte for this
    character)

    If you want the history behind the whole encoding mess, Joel Spolsky
    wrote a rather amusing article explaining how this all came about:
    http://www.joelonsoftware.com/articles/Unicode.html

    And the biggest reason to use Unicode is so that you don't have to
    worry about your program messing up because someone hands you input in
    a different encoding than you used.
  • Harrismh777 at May 11, 2011 at 10:51 pm
    Ian Kelly wrote:

    Ian, Benjamin, thanks much.
    The `unicode' class was renamed to `str', and a stripped-down version
    of the 2.X `str' class was renamed to `bytes'.
    ... thank you, this is very helpful.
    If I do not specify any code points above ascii 0xFF does any of this
    matter anyway?
    You mean 0x7F, and probably, due to the need to explicitly encode and decode.
    Yes, actually, I did... and from Benjamin's reply it seems that
    this matters only if I am working with bytes. Is it true that if I am
    working without using bytes sequences that I will not need to care about
    the encoding anyway, unless of course I need to specify a unicode code
    point?

    Thanks again.

    kind regards,
    m harris
  • John Machin at May 11, 2011 at 11:32 pm

    On Thu, May 12, 2011 8:51 am, harrismh777 wrote:
    Is it true that if I am
    working without using bytes sequences that I will not need to care about
    the encoding anyway, unless of course I need to specify a unicode code
    point?
    Quite the contrary.

    (1) You cannot work without using bytes sequences. Files are byte
    sequences. Web communication is in bytes. You need to (know / assume / be
    able to extract / guess) the input encoding. You need to encode your
    output using an encoding that is expected by the consumer (or use an
    output method that will do it for you).

    (2) You don't need to use bytes to specify a Unicode code point. Just use
    an escape sequence e.g. "\u0404" is a Cyrillic character.
  • Harrismh777 at May 12, 2011 at 1:22 am

    John Machin wrote:
    (1) You cannot work without using bytes sequences. Files are byte
    sequences. Web communication is in bytes. You need to (know / assume / be
    able to extract / guess) the input encoding. You need to encode your
    output using an encoding that is expected by the consumer (or use an
    output method that will do it for you).

    (2) You don't need to use bytes to specify a Unicode code point. Just use
    an escape sequence e.g. "\u0404" is a Cyrillic character.
    Thanks John. In reverse order, I understand point (2). I'm less clear
    on point (1).

    If I generate a string of characters that I presume to be ascii/utf-8
    (no \u0404 type characters) and write them to a file (stdout) how does
    default encoding affect that file.by default..? I'm not seeing that
    there is anything unusual going on... If I open the file with vi? If
    I open the file with gedit? emacs?

    ....

    Another question... in mail I'm receiving many small blocks that look
    like sprites with four small hex codes, scattered about the mail...
    mostly punctuation, maybe? ... guessing, are these unicode code
    points, and if so what is the best way to 'guess' the encoding? ... is
    it coded in the stream somewhere...protocol?

    thanks
  • MRAB at May 12, 2011 at 2:31 am

    On 12/05/2011 02:22, harrismh777 wrote:
    John Machin wrote:
    (1) You cannot work without using bytes sequences. Files are byte
    sequences. Web communication is in bytes. You need to (know / assume / be
    able to extract / guess) the input encoding. You need to encode your
    output using an encoding that is expected by the consumer (or use an
    output method that will do it for you).

    (2) You don't need to use bytes to specify a Unicode code point. Just use
    an escape sequence e.g. "\u0404" is a Cyrillic character.
    Thanks John. In reverse order, I understand point (2). I'm less clear on
    point (1).

    If I generate a string of characters that I presume to be ascii/utf-8
    (no \u0404 type characters) and write them to a file (stdout) how does
    default encoding affect that file.by default..? I'm not seeing that
    there is anything unusual going on... If I open the file with vi? If I
    open the file with gedit? emacs?

    ....

    Another question... in mail I'm receiving many small blocks that look
    like sprites with four small hex codes, scattered about the mail...
    mostly punctuation, maybe? ... guessing, are these unicode code points,
    and if so what is the best way to 'guess' the encoding? ... is it coded
    in the stream somewhere...protocol?
    You need to understand the difference between characters and bytes.

    A string contains characters, a file contains bytes.

    The encoding specifies how a character is represented as bytes.

    For example:

    In the Latin-1 encoding, the character "?" is represented by the
    byte 0xA3.

    In the UTF-8 encoding, the character "?" is represented by the byte
    sequence 0xC2 0xA3.

    In the ASCII encoding, the character "?" can't be represented at all.

    The advantage of UTF-8 is that it can represent _all_ Unicode
    characters (codepoints, actually) as byte sequences, and all those in
    the ASCII range are represented by the same single bytes which the
    original ASCII system used. Use the UTF-8 encoding unless you have to
    use a different one.

    A file contains only bytes, a socket handles only bytes. Which encoding
    you should use for characters is down to protocol. A system such as
    email, which can handle different encodings, should have a way of
    specifying the encoding, and perhaps also a default encoding.
  • Steven D'Aprano at May 12, 2011 at 3:16 am

    On Thu, 12 May 2011 03:31:18 +0100, MRAB wrote:

    Another question... in mail I'm receiving many small blocks that look
    like sprites with four small hex codes, scattered about the mail...
    mostly punctuation, maybe? ... guessing, are these unicode code points,
    and if so what is the best way to 'guess' the encoding? ... is it coded
    in the stream somewhere...protocol?
    You need to understand the difference between characters and bytes.

    http://www.joelonsoftware.com/articles/Unicode.html

    is also a good resource.


    --
    Steven
  • Harrismh777 at May 12, 2011 at 3:44 am

    Steven D'Aprano wrote:
    You need to understand the difference between characters and bytes.
    http://www.joelonsoftware.com/articles/Unicode.html

    is also a good resource.
    Thanks for being patient guys, here's what I've done:
    astr="pound sign"
    asym=" \u00A3"
    afile=open("myfile", mode='w')
    afile.write(astr + asym)
    12
    afile.close()

    When I edit "myfile" with vi I see the 'characters' :

    pound sign ?

    ... same with emacs, same with gedit ...


    When I hexdump myfile I see this:

    0000000 6f70 6375 2064 6973 6e67 c220 00a3


    This is *not* what I expected... well it is (little-endian) right up to
    the 'c2' and that is what is confusing me....

    I did not open the file with an encoding of UTF-8... so I'm assuming
    UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
    'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....

    See my problem?... when I open the file with emacs I see the character
    pound sign... same with gedit... they're all using UTF-8 by default. By
    default it looks like Python3 is writing output with UTF-8 as default...
    and I thought that by default Python3 was using either UTF-16 or UTF-32.
    So, I'm confused here... also, I used the character sequence \u00A3
    which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
    which is the normal UTF-8...

    Thanks again for your patience... I really do hate to be dense about
    this... but this is another area where I'm just beginning to dabble and
    I'd like to know soon what I'm doing...

    Thanks for the link Steve... I'm headed there now...




    kind regards,
    m harris
  • Terry Reedy at May 12, 2011 at 4:12 am

    On 5/11/2011 11:44 PM, harrismh777 wrote:
    Steven D'Aprano wrote:
    You need to understand the difference between characters and bytes.
    http://www.joelonsoftware.com/articles/Unicode.html

    is also a good resource.
    Thanks for being patient guys, here's what I've done:
    astr="pound sign"
    asym=" \u00A3"
    afile=open("myfile", mode='w')
    afile.write(astr + asym)
    12
    afile.close()

    When I edit "myfile" with vi I see the 'characters' :

    pound sign ?

    ... same with emacs, same with gedit ...


    When I hexdump myfile I see this:

    0000000 6f70 6375 2064 6973 6e67 c220 00a3
    This is *not* what I expected... well it is (little-endian) right up to
    the 'c2' and that is what is confusing me....
    I did not open the file with an encoding of UTF-8... so I'm assuming
    UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
    'A300' but what I got instead was UTF-8 little-endian 'c2a3' ....

    See my problem?... when I open the file with emacs I see the character
    pound sign... same with gedit... they're all using UTF-8 by default. By
    default it looks like Python3 is writing output with UTF-8 as default...
    and I thought that by default Python3 was using either UTF-16 or UTF-32.
    So, I'm confused here... also, I used the character sequence \u00A3
    which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
    which is the normal UTF-8...
    If you open a file as binary (bytes), you must write bytes, and they are
    stored without transformation. If you open in text mode, you must write
    text (string as unicode in 3.2) and Python will encode to bytes using
    either some default or the encoding you specified in the open statement.
    It does not matter how Python stored the unicode internally. Does this
    help? Your intent is signalled by how you open the file.

    --
    Terry Jan Reedy
  • John Machin at May 12, 2011 at 4:14 am

    On Thu, May 12, 2011 1:44 pm, harrismh777 wrote:
    By
    default it looks like Python3 is writing output with UTF-8 as default...
    and I thought that by default Python3 was using either UTF-16 or UTF-32.
    So, I'm confused here... also, I used the character sequence \u00A3
    which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
    which is the normal UTF-8...
    Python uses either a 16-bit or a 32-bit INTERNAL representation of Unicode
    code points. Those NN bits have nothing to do with the UTF-NN encodings,
    which can be used to encode the codepoints as byte sequences for EXTERNAL
    purposes. In your case, UTF-8 has been used as it is the default encoding
    on your platform.
  • Benjamin Kaplan at May 12, 2011 at 4:14 am

    On Wed, May 11, 2011 at 8:44 PM, harrismh777 wrote:
    Steven D'Aprano wrote:
    You need to understand the difference between characters and bytes.
    http://www.joelonsoftware.com/articles/Unicode.html

    is also a good resource.
    Thanks for being patient guys, here's what I've done:
    astr="pound sign"
    asym=" \u00A3"
    afile=open("myfile", mode='w')
    afile.write(astr + asym)
    12
    afile.close()

    When I edit "myfile" with vi I see the 'characters' :

    pound sign ?

    ? ... same with emacs, same with gedit ?...


    When I hexdump myfile I see this:

    0000000 6f70 6375 2064 6973 6e67 c220 00a3


    This is *not* what I expected... well it is (little-endian) right up to the
    'c2' and that is what is confusing me....

    I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16
    by default (python3) so I was expecting a '00A3' little-endian as 'A300' but
    what I got instead was UTF-8 little-endian ?'c2a3' ....
    quick note here: UTF-8 doesn't have an endian-ness. It's always read
    from left to right, with the high bit telling you whether you need to
    continue or not. So it's always "little endian".
    See my problem?... when I open the file with emacs I see the character pound
    sign... same with gedit... they're all using UTF-8 by default. By default it
    looks like Python3 is writing output with UTF-8 as default... and I thought
    that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused
    here... ?also, I used the character sequence \u00A3 which I thought was
    UTF-16... but Python3 changed my intent to ?'c2a3' which is the normal
    UTF-8...
    The fact that CPython uses UCS-2 or UCS-4 internally is an
    implementation detail and isn't actually part of the Python
    specification. As far as a Python program is concerned, a Unicode
    string is a list of character objects, not bytes. Much like any other
    object, a unicode character needs to be serialized before it can be
    written to a file. An encoding is a serialization function for
    characters.

    If the file you're writing to doesn't specify an encoding, Python will
    default to locale.getdefaultencoding(), which tries to get your
    system's preferred encoding from environment variables (in other
    words, the same source that emacs and gedit will use to get the
    default encoding).
  • John Machin at May 12, 2011 at 4:41 am

    On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:
    If the file you're writing to doesn't specify an encoding, Python will
    default to locale.getdefaultencoding(),
    No such attribute. Perhaps you mean locale.getpreferredencoding()
  • Harrismh777 at May 12, 2011 at 6:14 am

    John Machin wrote:
    On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:

    If the file you're writing to doesn't specify an encoding, Python will
    default to locale.getdefaultencoding(),
    No such attribute. Perhaps you mean locale.getpreferredencoding()
    import locale
    locale.getpreferredencoding()
    'UTF-8'
    >>>

    Yessssssss!


    :)
  • TheSaint at May 12, 2011 at 12:40 pm

    John Machin wrote:
    On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:

    If the file you're writing to doesn't specify an encoding, Python will
    default to locale.getdefaultencoding(),
    No such attribute. Perhaps you mean locale.getpreferredencoding()
    what about sys.getfilesystemencoding()
    In the event to distribuite a program how to guess which encoding will the
    user has?


    --
    goto /dev/null
  • Harrismh777 at May 12, 2011 at 6:43 am

    Terry Reedy wrote:
    It does not matter how Python stored the unicode internally. Does this
    help? Your intent is signalled by how you open the file.
    Very much, actually, thanks. I was missing the 'internal' piece, and
    did not realize that if I didn't specify the encoding on the open that
    python would pull the default encoding from locale...


    kind regards,
    m harris
  • John Machin at May 12, 2011 at 3:54 am

    On Thu, May 12, 2011 11:22 am, harrismh777 wrote:
    John Machin wrote:
    (1) You cannot work without using bytes sequences. Files are byte
    sequences. Web communication is in bytes. You need to (know / assume /
    be
    able to extract / guess) the input encoding. You need to encode your
    output using an encoding that is expected by the consumer (or use an
    output method that will do it for you).

    (2) You don't need to use bytes to specify a Unicode code point. Just
    use
    an escape sequence e.g. "\u0404" is a Cyrillic character.
    Thanks John. In reverse order, I understand point (2). I'm less clear
    on point (1).

    If I generate a string of characters that I presume to be ascii/utf-8
    (no \u0404 type characters)
    and write them to a file (stdout) how does
    default encoding affect that file.by default..? I'm not seeing that
    there is anything unusual going on...
    About """characters that I presume to be ascii/utf-8 (no \u0404 type
    characters)""": All Unicode characters (including U+0404) are encodable in
    bytes using UTF-8.

    The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
    mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
    linux/OSX/platform. On a typical American / Western European /[former]
    colonies Windows box, this is likely to be cp850 on a Command Prompt
    window, and cp1252 in IDLE.

    UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
    if the terminal can't render the character -- you'll get spaces or blobs
    or boxes with hex digits in them or nothing.

    Windows (Command Prompt window): only a small subset of characters can be
    encoded in e.g. cp850; anything else causes an exception.

    Windows (IDLE): ignores sys.stdout.encoding and renders the characters
    itself. Same outcome as *x/UTF-8 above.

    If you write directly (or sys.stdout is redirected) to a FILE, the default
    encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
    the machine's site.py has been fiddled with to make it UTF-8 or something
    else.
    If I open the file with vi? If
    I open the file with gedit? emacs?
    Any editor will have a default encoding; if that doesn't match the file
    encoding, you have a (hopefully obvious) problem if the editor doesn't
    detect the mismatch. Consult your editor's docs or HTFF1K.
    Another question... in mail I'm receiving many small blocks that look
    like sprites with four small hex codes, scattered about the mail...
    mostly punctuation, maybe? ... guessing, are these unicode code
    points, yes
    and if so what is the best way to 'guess' the encoding?
    google("chardet") or rummage through the mail headers (but 4 hex digits in
    a box are a symptom of inability to render, not necessarily caused by an
    incorrect decoding)

    ... is
    it coded in the stream somewhere...protocol?
    Should be.
  • Ben Finney at May 12, 2011 at 4:07 am

    MRAB <python at mrabarnett.plus.com> writes:

    You need to understand the difference between characters and bytes.
    Yep. Those who don't need to join us in the third millennium, and the
    resources pointed out in this thread are good to help that.
    A string contains characters, a file contains bytes.
    That's not true for Python 2.

    I'd phrase that as:

    * Text is a sequence of characters. Most inputs to the program,
    including files, sockets, etc., contain a sequence of bytes.

    * Always know whether you're dealing with text or with bytes. No object
    can be both.

    * In Python 2, ?str? is the type for a sequence of bytes. ?unicode? is
    the type for text.

    * In Python 3, ?str? is the type for text. ?bytes? is the type for a
    sequence of bytes.

    --
    \ ?I went to a garage sale. ?How much for the garage?? ?It's not |
    `\ for sale.?? ?Steven Wright |
    _o__) |
    Ben Finney
  • Harrismh777 at May 12, 2011 at 6:31 am

    Ben Finney wrote:
    I'd phrase that as:
    * Text is a sequence of characters. Most inputs to the program,
    including files, sockets, etc., contain a sequence of bytes.
    * Always know whether you're dealing with text or with bytes. No object
    can be both.
    * In Python 2, ?str? is the type for a sequence of bytes. ?unicode? is
    the type for text.
    * In Python 3, ?str? is the type for text. ?bytes? is the type for a
    sequence of bytes.

    That is very helpful... thanks


    MRAB, Steve, John, Terry, Ben F, Ben K, Ian...
    ...thank you guys so much, I think I've got a better picture now of
    what is going on... this is also one place where I don't think the books
    are as clear as they need to be at least for me...(Lutz, Summerfield).

    So, the UTF-16 UTF-32 is INTERNAL only, for Python... and text in/out is
    based on locale... in my case UTF-8 ...that is enormously helpful for
    me... understanding locale on this system is as mystifying as unicode is
    in the first place.
    Well, after reading about unicode tonight (about four hours) I realize
    that its not really that hard... there's just a lot of details that have
    to come together. Straightening out that whole tower-of-babel thing is
    sure a pain in the butt.
    I also was not aware that UTF-8 chars could be up to six(6) byes long
    from left to right. I see now that the little-endianness I was
    ascribing to python is just a function of hexdump... and I was a little
    disappointed to find that hexdump does not support UTF-8, just ascii...doh.
    Anyway, thanks again... I've got enough now to play around a bit...

    PS thanks Steve for that link, informative and entertaining too... Joe
    says, "If you are a programmer . . . and you don't know the basics of
    characters, character sets, encodings, and Unicode, and I catch you, I'm
    going to punish you by making you peel onions for 6 months in a
    submarine. I swear I will". :)








    kind regards,
    m harris
  • John Machin at May 12, 2011 at 7:58 am

    On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:
    So, the UTF-16 UTF-32 is INTERNAL only, for Python
    NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
    encodings for the EXTERNAL representation of Unicode characters in byte
    streams.
    I also was not aware that UTF-8 chars could be up to six(6) byes long
    from left to right.
    It could be, once upon a time in ISO faerieland, when it was thought that
    Unicode could grow to 2**32 codepoints. However ISO and the Unicode
    consortium have agreed that 17 planes is the utter max, and accordingly a
    valid UTF-8 byte sequence can be no longer than 4 bytes ... see below
    chr(17 * 65536)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    ValueError: chr() arg not in range(0x110000)
    chr(17 * 65536 - 1)
    '\U0010ffff'
    _.encode('utf8')
    b'\xf4\x8f\xbf\xbf'
    b'\xf5\x8f\xbf\xbf'.decode('utf8')
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\python32\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0:
    invalid start byte
  • Ian Kelly at May 12, 2011 at 4:17 pm

    On Thu, May 12, 2011 at 1:58 AM, John Machin wrote:
    On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:


    So, the UTF-16 UTF-32 is INTERNAL only, for Python
    NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
    encodings for the EXTERNAL representation of Unicode characters in byte
    streams.
    Right. *Under the hood* Python uses UCS-2 (which is not exactly the
    same thing as UTF-16, by the way) to represent Unicode strings.
    However, this is entirely transparent. To the Python programmer, a
    unicode string is just an abstraction of a sequence of code-points.
    You don't need to think about UCS-2 at all. The only times you need
    to worry about encodings are when you're encoding unicode characters
    to byte strings, or decoding bytes to unicode characters, or opening a
    stream in text mode; and in those cases the only encoding that matters
    is the external one.
  • Terry Reedy at May 12, 2011 at 8:42 pm

    On 5/12/2011 12:17 PM, Ian Kelly wrote:
    On Thu, May 12, 2011 at 1:58 AM, John Machinwrote:
    On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:


    So, the UTF-16 UTF-32 is INTERNAL only, for Python
    NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
    encodings for the EXTERNAL representation of Unicode characters in byte
    streams.
    Right. *Under the hood* Python uses UCS-2 (which is not exactly the
    same thing as UTF-16, by the way) to represent Unicode strings.
    I know some people say that, but according to the definitions of the
    unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent
    chars in the Supplementary Planes. The later (1996) UTF-16, which Python
    uses, can. The standard considers 'UCS-2' obsolete long ago. See

    https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
    or http://www.unicode.org/faq/basic_q.html#14

    The latter says: "Q: What is the difference between UCS-2 and UTF-16?
    A: UCS-2 is obsolete terminology which refers to a Unicode
    implementation up to Unicode 1.1, before surrogate code points and
    UTF-16 were added to Version 2.0 of the standard. This term should now
    be avoided."

    It goes on: "Sometimes in the past an implementation has been labeled
    "UCS-2" to indicate that it does not support supplementary characters
    and doesn't interpret pairs of surrogate code points as characters. Such
    an implementation would not handle processing of character properties,
    code point boundaries, collation, etc. for supplementary characters."

    I know that 16-bit Python *does* use surrogate pairs for supplementary
    chars and at least some properties work for them. I am not sure exactly
    what the rest means.
    However, this is entirely transparent. To the Python programmer, a
    unicode string is just an abstraction of a sequence of code-points.
    You don't need to think about UCS-2 at all. The only times you need
    to worry about encodings are when you're encoding unicode characters
    to byte strings, or decoding bytes to unicode characters, or opening a
    stream in text mode; and in those cases the only encoding that matters
    is the external one.
    If one uses unicode chars in the Supplementary Planes above the BMP (the
    first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16),
    then the abstraction leaks.

    --
    Terry Jan Reedy
  • Ian Kelly at May 12, 2011 at 10:25 pm

    On Thu, May 12, 2011 at 2:42 PM, Terry Reedy wrote:
    On 5/12/2011 12:17 PM, Ian Kelly wrote:
    Right. ?*Under the hood* Python uses UCS-2 (which is not exactly the
    same thing as UTF-16, by the way) to represent Unicode strings.
    I know some people say that, but according to the definitions of the unicode
    consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
    Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
    standard considers 'UCS-2' obsolete long ago. See

    https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
    or http://www.unicode.org/faq/basic_q.html#14
    At the first link, in the section _Use in major operating systems and
    environments_ it states, "The Python language environment officially
    only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
    "Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
    (UTF-32) but this is commonly only done on Unix systems."

    PEP 100 says:

    The internal format for Unicode objects should use a Python
    specific fixed format <PythonUnicode> implemented as 'unsigned
    short' (or another unsigned numeric type having 16 bits). Byte
    order is platform dependent.

    This format will hold UTF-16 encodings of the corresponding
    Unicode ordinals. The Python Unicode implementation will address
    these values as if they were UCS-2 values. UCS-2 and UTF-16 are
    the same for all currently defined Unicode character points.
    UTF-16 without surrogates provides access to about 64k characters
    and covers all characters in the Basic Multilingual Plane (BMP) of
    Unicode.

    It is the Codec's responsibility to ensure that the data they pass
    to the Unicode object constructor respects this assumption. The
    constructor does not check the data for Unicode compliance or use
    of surrogates.

    I'm getting out of my depth here, but that implies to me that while
    Python stores UTF-16 and can correctly encode/decode it to UTF-8,
    other codecs might only work correctly with UCS-2, and the unicode
    class itself ignores surrogate pairs.

    Although I'm not sure how much this might have changed since the
    original implementation, especially for Python 3.
  • Jmfauth at May 13, 2011 at 6:28 am

    On 12 mai, 18:17, Ian Kelly wrote:

    ...
    to worry about encodings are when you're encoding unicode characters
    to byte strings, or decoding bytes to unicode characters

    A small but important correction/clarification:

    In Unicode, "unicode" does not encode a *character*. It
    encodes a *code point*, a number, the integer associated
    to the character.

    jmf
  • Harrismh777 at May 13, 2011 at 7:53 pm

    jmfauth wrote:
    to worry about encodings are when you're encoding unicode characters
    to byte strings, or decoding bytes to unicode characters
    A small but important correction/clarification:

    In Unicode, "unicode" does not encode a*character*. It
    encodes a*code point*, a number, the integer associated
    to the character.
    That is a huge code-point... pun intended.

    ... and there is another point that I continue to be somewhat puzzled
    about, and that is the issue of fonts.

    On of my hobbies at the moment is ancient Greek (biblical studies,
    Septuaginta LXX, and Greek New Testament). I have these texts on my
    computer in a folder in several formats... pdf, unicode 'plaintext',
    osis.xml, and XML.

    These texts may be found at http://sblgnt.com

    I am interested for the moment only in the 'plaintext' stream,
    because it is unicode. ( first, in unicode, according to all the doc
    there is no such thing as 'plaintext,' so keep that in mind).

    When I open the text stream in one of my unicode editors I can see
    'most' of the characters in a rudimentary Greek font with accents;
    however, I also see many tiny square blocks indicating (I think) that
    the code points do *not* have a corresponding character in my unicode
    font for that Greek symbol (whatever it is supposed to be).

    The point, or question is, how does one go about making sure that
    there is a corresponding font glyph to match a specific unicode code
    point for display in a particular terminal (editor, browser, whatever) ?

    The unicode consortium is very careful to make sure that thousands
    of symbols have a unique code point (that's great !) but how do these
    thousands of symbols actually get displayed if there is no font
    consortium? Are there collections of 'standard' fonts for unicode that
    I am not aware? Is there a unix linux package that can be installed
    that drops at least 'one' default standard font that will be able to
    render all or 'most' (whatever I mean by that) code points in unicode?
    Is this a Python issue at all?


    kind regards,
    m harris
  • Robert Kern at May 13, 2011 at 8:18 pm

    On 5/13/11 2:53 PM, harrismh777 wrote:

    The unicode consortium is very careful to make sure that thousands of symbols
    have a unique code point (that's great !) but how do these thousands of symbols
    actually get displayed if there is no font consortium? Are there collections of
    'standard' fonts for unicode that I am not aware?
    There are some well-known fonts that try to cover a large section of the Unicode
    standard.

    http://en.wikipedia.org/wiki/Unicode_typeface
    Is there a unix linux package
    that can be installed that drops at least 'one' default standard font that will
    be able to render all or 'most' (whatever I mean by that) code points in
    unicode? Is this a Python issue at all?
    Not really.

    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
  • Terry Reedy at May 14, 2011 at 1:41 am

    On 5/13/2011 3:53 PM, harrismh777 wrote:

    The unicode consortium is very careful to make sure that thousands of
    symbols have a unique code point (that's great !) but how do these
    thousands of symbols actually get displayed if there is no font
    consortium? Are there collections of 'standard' fonts for unicode that I
    am not aware? Is there a unix linux package that can be installed that
    drops at least 'one' default standard font that will be able to render
    all or 'most' (whatever I mean by that) code points in unicode? Is this
    a Python issue at all?
    Easy, practical use of unicode is still a work in progress.

    --
    Terry Jan Reedy
  • Harrismh777 at May 14, 2011 at 7:41 am

    Terry Reedy wrote:
    Is there a unix linux package that can be installed that
    drops at least 'one' default standard font that will be able to render
    all or 'most' (whatever I mean by that) code points in unicode? Is this
    a Python issue at all?
    Easy, practical use of unicode is still a work in progress.
    Apparently... the good news for me is that SBL provides their unicode
    font here:

    http://www.sbl-site.org/educational/biblicalfonts.aspx

    I'm getting much closer here, but now the problem is typing. The pain
    with unicode fonts is that the glyph is tied to the code point for the
    represented character, and not tied to any code point that matches any
    keyboard scan code for typing. :-}

    So, I can now see the ancient text with accents and aparatus in all of
    my editors, but I still cannot type any ancient Greek with my
    keyboard... because I have to make up a keymap first. <sigh>

    I don't find that SBL (nor Logos Software) has provided keymaps as
    yet... rats.

    I can read the test with Python though... yessss.


    m harris
  • Jmfauth at May 14, 2011 at 10:26 am

    On 14 mai, 09:41, harrismh777 wrote:

    ...
    I'm getting much closer here,
    ...
    You should really understand, that Unicode is a domain per
    se. It is independent from any os's, programming languages
    or applications. It is up to these tools to be "unicode"
    compliant.

    Working in a full unicode mode (at least for texts) is
    today practically a solved problem. But you have to ensure
    the whole toolchain is unicode compliant (editors,
    fonts (OpenType technology), rendering devices, ...).

    Tip. This list is certainly not the best place to grab
    informations. I suggest you start by getting informations
    about XeTeX. XeTeX is the "new" TeX engine working only
    in a unicode mode. From this starting point, you will
    fall on plenty web sites speaking about the "unicode
    world", tools, fonts, ...

    A variant is to visit sites speaking about *typography*.

    jmf
  • Terry Reedy at May 14, 2011 at 8:26 pm

    On 5/14/2011 3:41 AM, harrismh777 wrote:
    Terry Reedy wrote:
    Easy, practical use of unicode is still a work in progress.
    Apparently... the good news for me is that SBL provides their unicode
    font here:

    http://www.sbl-site.org/educational/biblicalfonts.aspx

    I'm getting much closer here, but now the problem is typing. The pain
    with unicode fonts is that the glyph is tied to the code point for the
    represented character, and not tied to any code point that matches any
    keyboard scan code for typing. :-}

    So, I can now see the ancient text with accents and aparatus in all of
    my editors, but I still cannot type any ancient Greek with my
    keyboard... because I have to make up a keymap first. <sigh>

    I don't find that SBL (nor Logos Software) has provided keymaps as
    yet... rats.
    You need what is called, at least with Windows, an IME -- Input Method
    Editor. These are part of (or associated with) the OS, so they can be
    used with *any* application that will accept unicode chars (in whatever
    encoding) rather than just ascii chars. Windows has about a hundred or
    so, including Greek. I do not know if that includes classical Greek with
    the extra marks.
    I can read the test with Python though... yessss.
    --
    Terry Jan Reedy
  • Ben Finney at May 14, 2011 at 11:47 pm

    Terry Reedy <tjreedy at udel.edu> writes:

    You need what is called, at least with Windows, an IME -- Input Method
    Editor.
    For a GNOME or KDE environment you want an input method framework; I
    recommend IBus <URL:http://code.google.com/p/ibus/> which comes with the
    major GNU+Linux operating systems <URL:http://oswatershed.org/pkg/ibus>
    <URL:http://packages.debian.org/squeeze/ibus> .

    Then you have a wide range of input methods available. Many of them are
    specific to local writing systems. For writing special characters in
    English text, I use either ?rfc1345? or ?latex? within IBus.

    That allows special characters to be typed into any program which
    communicates with the desktop environment's input routines. Yay, unified
    input of special characters!

    Except Emacs :-( which fortunately has ?ibus-el? available to work with
    IBus <URL:http://www.emacswiki.org/emacs/IBusMode> :-).

    --
    \ ??????????|
    `\ (What is undesirable to you, do not do to others.) |
    _o__) ???? Confucius, 551 BCE ? 479 BCE |
    Ben Finney
  • Nobody at May 14, 2011 at 8:34 am

    On Fri, 13 May 2011 14:53:50 -0500, harrismh777 wrote:

    The unicode consortium is very careful to make sure that thousands
    of symbols have a unique code point (that's great !) but how do these
    thousands of symbols actually get displayed if there is no font
    consortium? Are there collections of 'standard' fonts for unicode that I
    am not aware? Is there a unix linux package that can be installed that
    drops at least 'one' default standard font that will be able to render all
    or 'most' (whatever I mean by that) code points in unicode?
    Using the original meaning of "font" (US) or "fount" (commonwealth), you
    can't have a single font cover the whole of Unicode. A font isn't a random
    set of glyphs, but a set of glyphs in a common style, which can only
    practically be achieved for a specific alphabet.

    You can bundle multiple fonts covering multiple repertoires into a single
    TTF (etc) file, but there's not much point.

    In software, the term "font" is commonly used to refer to some ad-hoc
    mapping between codepoints and glyphs. This typically works by either
    associating each specific font with a specific repertoire (set of
    codepoints), or by simply trying each font in order until one is found
    with the correct glyph.

    This is a sufficiently common problem that the FontConfig library exists
    to simplify a large part of it.
    Is this a Python issue at all?
    No.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMay 11, '11 at 9:37p
activeMay 14, '11 at 11:47p
posts32
users12
websitepython.org

People

Translate

site design / logo © 2022 Grokbase