FAQ
I'm trying to understand what's going on with this simple program


if __name__=='__main__':
  print("repr=%s" % repr(u'\xc1'))
  print("%%r=%r" % u'\xc1')


On my windows XP box this fails miserably if run directly at a terminal


C:\tmp> \Python33\python.exe bang.py
Traceback (most recent call last):
    File "bang.py", line 2, in <module>
      print("repr=%s" % repr(u'\xc1'))
    File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6:
character maps to <undefined>


If I run the program redirected into a file then no error occurs and the the
result looks like this


C:\tmp>cat fff
repr='?'
%r='?'


and if I run it into a pipe it works as though into a file.


It seems that repr thinks it can render u'\xc1' directly which is a problem
since print then seems to want to convert that to cp437 if directed into a terminal.


I find the idea that print knows what it's printing to a bit dangerous, but it's
the repr behaviour that strikes me as bad.


What is responsible for defining the repr function's 'printable' so that repr
would give me say an Ascii rendering?
-confused-ly yrs-
Robin Becker

Search Discussions

  • Ned Batchelder at Nov 15, 2013 at 11:38 am

    On Friday, November 15, 2013 6:28:15 AM UTC-5, Robin Becker wrote:
    I'm trying to understand what's going on with this simple program

    if __name__=='__main__':
    print("repr=%s" % repr(u'\xc1'))
    print("%%r=%r" % u'\xc1')

    On my windows XP box this fails miserably if run directly at a terminal

    C:\tmp> \Python33\python.exe bang.py
    Traceback (most recent call last):
    File "bang.py", line 2, in <module>
    print("repr=%s" % repr(u'\xc1'))
    File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6:
    character maps to <undefined>

    If I run the program redirected into a file then no error occurs and the the
    result looks like this

    C:\tmp>cat fff
    repr='?'
    %r='?'

    and if I run it into a pipe it works as though into a file.

    It seems that repr thinks it can render u'\xc1' directly which is a problem
    since print then seems to want to convert that to cp437 if directed into a terminal.

    I find the idea that print knows what it's printing to a bit dangerous, but it's
    the repr behaviour that strikes me as bad.

    What is responsible for defining the repr function's 'printable' so that repr
    would give me say an Ascii rendering?
    -confused-ly yrs-
    Robin Becker

    In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.


    --Ned.
  • Robin Becker at Nov 15, 2013 at 12:16 pm
    On 15/11/2013 11:38, Ned Batchelder wrote:
    ..........
    In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.

    --Ned.

    thanks for this, edoesn't make the split across python2 - 3 any easier.
    --
    Robin Becker
  • Ned Batchelder at Nov 15, 2013 at 1:54 pm

    On Friday, November 15, 2013 7:16:52 AM UTC-5, Robin Becker wrote:
    On 15/11/2013 11:38, Ned Batchelder wrote:
    ..........
    In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.

    --Ned.
    thanks for this, edoesn't make the split across python2 - 3 any easier.
    --
    Robin Becker

    No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:


         try:
             repr = ascii
         except NameError:
             pass


    and then use repr throughout.


    --Ned.
  • Robin Becker at Nov 15, 2013 at 2:29 pm
    On 15/11/2013 13:54, Ned Batchelder wrote:
    .........
    No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:

    try:
    repr = ascii
    except NameError:
    pass
    ....
    yes I tried that, but it doesn't affect %r which is inlined in unicodeobject.c,
    for me it seems easier to fix windows to use something like a standard encoding
    of utf8 ie cp65001, but that's quite hard to do globally. It seems sitecustomize
    is too late to set os.environ['PYTHONIOENCODING'], perhaps I can stuff that into
    one of the global environment vars and have it work for all python invocations.
    --
    Robin Becker
  • Serhiy Storchaka at Nov 15, 2013 at 2:40 pm

    15.11.13 15:54, Ned Batchelder ???????(??):
    No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:

    try:
    repr = ascii
    except NameError:
    pass

    and then use repr throughout.

    Or rather


          try:
              ascii
          except NameError:
              ascii = repr


    and then use ascii throughout.
  • Robin Becker at Nov 15, 2013 at 2:52 pm
    On 15/11/2013 14:40, Serhiy Storchaka wrote:
    ......



    and then use repr throughout.
    Or rather

    try:
    ascii
    except NameError:
    ascii = repr

    and then use ascii throughout.

    apparently you can import ascii from future_builtins and the print() function is
    available as


    from __future__ import print_function


    nothing fixes all those %r formats to be %a though :(
    --
    Robin Becker
  • Roy Smith at Nov 15, 2013 at 2:25 pm
    In article <b6db8982-feac-4036-8ec4-2dc720d41a4b@googlegroups.com>,
    Ned Batchelder wrote:

    In Python3, repr() will return a Unicode string, and will preserve existing
    Unicode characters in its arguments. This has been controversial. To get
    the Python 2 behavior of a pure-ascii representation, there is the new
    builtin ascii(), and a corresponding %a format string.

    I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.


    The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.


    Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:


    MAIN() \(
      PRINTF("HELLO, ASCII WORLD");
    \)


    because ASR-33's didn't have curly braces (or lower case).


    Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.


    --
    Roy Smith
    roy at panix.com


    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20131115/f74305d9/attachment.html>
  • Robin Becker at Nov 15, 2013 at 2:43 pm
    ..........
    I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.

    The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.

    unfortunately the word 'printable' got into the definition of repr; it's clear
    that printability is not the same as unicode at least as far as the print
    function is concerned. In my opinion it would have been better to leave the old
    behaviour as that would have eased the compatibility.


    The python gods don't count that sort of thing as important enough so we get the
    mess that is the python2/3 split. ReportLab has to do both so it's a real issue;
    in addition swapping the str - unicode pair to bytes str doesn't help one's
    mental models either :(


    Things went wrong when utf8 was not adopted as the standard encoding thus
    requiring two string types, it would have been easier to have a len function to
    count bytes as before and a glyphlen to count glyphs. Now as I understand it we
    have a complicated mess under the hood for unicode objects so they have a
    variable representation to approximate an 8 bit representation when suitable etc
    etc etc.

    Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:

    MAIN() \(
    PRINTF("HELLO, ASCII WORLD");
    \)

    because ASR-33's didn't have curly braces (or lower case).

    Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.
    .....
    I can certainly remember those days, how we cried and laughed when 8 bits became
    popular.
    --
    Robin Becker
  • Joel Goldstick at Nov 15, 2013 at 2:50 pm

    Some of us have been doing this long enough to remember when "just plain
    text" meant only a single case of the alphabet (and a subset of ascii
    punctuation). On an ASR-33, your C program would print like:

    MAIN() \(
    PRINTF("HELLO, ASCII WORLD");
    \)

    because ASR-33's didn't have curly braces (or lower case).

    Having P3's repr() escape non-ascii characters today makes about as much
    sense as expecting P2's repr() to escape curly braces (and vertical bars,
    and a few others) because not every terminal can print those.
    .....
    I can certainly remember those days, how we cried and laughed when 8 bits
    became popular.
    Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
    ;). That eighth bit sure was less confusing than codepoint
    translations
  • Robin Becker at Nov 15, 2013 at 3:03 pm
    ...........
    became popular.
    Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
    ;). That eighth bit sure was less confusing than codepoint
    translations



    no we had 6 bits in 60 bit words as I recall; extracting the nth character
    involved division by 6; smart people did tricks with inverted multiplications
    etc etc :(
    --
    Robin Becker
  • Joel Goldstick at Nov 15, 2013 at 3:07 pm

    On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker wrote:
    ...........
    became popular.
    Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
    ;). That eighth bit sure was less confusing than codepoint
    translations


    no we had 6 bits in 60 bit words as I recall; extracting the nth character
    involved division by 6; smart people did tricks with inverted
    multiplications etc etc :(
    --

    Cool, someone here is older than me! I came in with the 8080, and I
    remember split octal, but sixes are something I missed out on.
    Robin Becker





    --
    Joel Goldstick
    http://joelgoldstick.com
  • Robin Becker at Nov 15, 2013 at 3:18 pm
    On 15/11/2013 15:07, Joel Goldstick wrote:
    ........





    Cool, someone here is older than me! I came in with the 8080, and I
    remember split octal, but sixes are something I missed out on.

    The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s had
    12 bits I think, then came the IBM 7094 which had 36 bits and finally the
    CDC6000 & 7600 machines with 60 bits, some one must have liked 6's
    -mumbling-ly yrs-
    Robin Becker
  • Roy Smith at Nov 15, 2013 at 3:32 pm

    On Nov 15, 2013, at 10:18 AM, Robin Becker wrote:


    The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9

    I don't know about the 15, but the 10 had 36 bit words (18-bit halfwords). One common character packing was 5 7-bit characters per 36 bit word (with the sign bit left over).


    Anybody remember RAD-50? It let you represent a 6-character filename (plus a 3-character extension) in a 16 bit word. RT-11 used it, not sure if it showed up anywhere else.


    ---
    Roy Smith
    roy at panix.com


    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20131115/d299b1d2/attachment.html>
  • Zero Piraeus at Nov 15, 2013 at 5:06 pm
    :

    On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote:
    Anybody remember RAD-50? It let you represent a 6-character filename
    (plus a 3-character extension) in a 16 bit word. RT-11 used it, not
    sure if it showed up anywhere else.

    Presumably 16 is a typo, but I just had a moderate amount of fun
    envisaging how that might work: if the characters were restricted to
    vowels, then 5**6 < 2**14, giving a couple of bits left over for a
    choice of four preset "three-character" extensions.


    I can't say that AEIOUA.EX1 looks particularly appealing, though ...


      -[]z.


    --
    Zero Piraeus: pollice verso
    http://etiol.net/pubkey.asc
  • Chris Angelico at Nov 15, 2013 at 5:11 pm

    On Sat, Nov 16, 2013 at 4:06 AM, Zero Piraeus wrote:
    :
    On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote:
    Anybody remember RAD-50? It let you represent a 6-character filename
    (plus a 3-character extension) in a 16 bit word. RT-11 used it, not
    sure if it showed up anywhere else.
    Presumably 16 is a typo, but I just had a moderate amount of fun
    envisaging how that might work: if the characters were restricted to
    vowels, then 5**6 < 2**14, giving a couple of bits left over for a
    choice of four preset "three-character" extensions.

    I can't say that AEIOUA.EX1 looks particularly appealing, though ...

    Looks like it might be this scheme:


    https://en.wikipedia.org/wiki/DEC_Radix-50


    36-bit word for a 6-char filename, but there was also a 16-bit
    variant. I do like that filename scheme you describe, though it would
    tend to produce names that would suit virulent diseases.


    ChrisA
  • Serhiy Storchaka at Nov 15, 2013 at 5:37 pm

    15.11.13 17:32, Roy Smith ???????(??):
    Anybody remember RAD-50? It let you represent a 6-character filename
    (plus a 3-character extension) in a 16 bit word. RT-11 used it, not
    sure if it showed up anywhere else.

    In three 16-bit words.
  • William Ray Wing at Nov 15, 2013 at 4:30 pm

    On Nov 15, 2013, at 10:18 AM, Robin Becker wrote:


    On 15/11/2013 15:07, Joel Goldstick wrote:
    ........


    Cool, someone here is older than me! I came in with the 8080, and I
    remember split octal, but sixes are something I missed out on.
    The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s had 12 bits I think, then came the IBM 7094 which had 36 bits and finally the CDC6000 & 7600 machines with 60 bits, some one must have liked 6's
    -mumbling-ly yrs-
    Robin Becker
    --
    https://mail.python.org/mailman/listinfo/python-list

    Yes, the PDP-8s, LINC-8s, and PDP-12s were all 12-bit computers. However the LINC-8 operated with word-pairs (instruction in one location followed by address to be operated on in the next) so it was effectively a 24-bit computer and the PDP-12 was able to execute BOTH PDP-8 and LINC-8 instructions (it added one extra instruction to each set that flipped the mode).


    First assembly language program I ever wrote was on a PDP-12. (If there is an emoticon for a face with a gray beard, I don't know it.)


    -Bill
  • Gene Heskett at Nov 15, 2013 at 4:36 pm

    On Friday 15 November 2013 11:28:19 Joel Goldstick did opine:

    On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker wrote:
    ...........
    became popular.
    Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
    ;). That eighth bit sure was less confusing than codepoint
    translations
    no we had 6 bits in 60 bit words as I recall; extracting the nth
    character involved division by 6; smart people did tricks with
    inverted multiplications etc etc :(
    --
    Cool, someone here is older than me! I came in with the 8080, and I
    remember split octal, but sixes are something I missed out on.

    Ok, if you are feeling old & decrepit, hows this for a birthday: 10/04/34,
    I came into micro computers about RCA 1802 time. Wrote a program for the
    1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding
    CA, that was still in use in '94, but never really wrote assembly code
    until the 6809 was out in the Radio Shack Color Computers. os9 on the
    coco's was the best teacher about the unix way of doing things there ever
    was. So I tell folks these days that I am 39, with 40 years experience at
    being 39. ;-)

    Robin Becker



    Cheers, Gene
    --
    "There are four boxes to be used in defense of liberty:
      soap, ballot, jury, and ammo. Please use in that order."
    -Ed Howdershelt (Author)


    Counting in binary is just like counting in decimal -- if you are all
    thumbs.
       -- Glaser and Way
    A pen in the hand of this president is far more
    dangerous than 200 million guns in the hands of
              law-abiding citizens.
  • Mark Lawrence at Nov 15, 2013 at 5:58 pm

    On 15/11/2013 16:36, Gene Heskett wrote:
    On Friday 15 November 2013 11:28:19 Joel Goldstick did opine:
    On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker wrote:
    ...........
    became popular.
    Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
    ;). That eighth bit sure was less confusing than codepoint
    translations
    no we had 6 bits in 60 bit words as I recall; extracting the nth
    character involved division by 6; smart people did tricks with
    inverted multiplications etc etc :(
    --
    Cool, someone here is older than me! I came in with the 8080, and I
    remember split octal, but sixes are something I missed out on.
    Ok, if you are feeling old & decrepit, hows this for a birthday: 10/04/34,
    I came into micro computers about RCA 1802 time. Wrote a program for the
    1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding
    CA, that was still in use in '94, but never really wrote assembly code
    until the 6809 was out in the Radio Shack Color Computers. os9 on the
    coco's was the best teacher about the unix way of doing things there ever
    was. So I tell folks these days that I am 39, with 40 years experience at
    being 39. ;-)
    Robin Becker

    Cheers, Gene

    I also used the RCA 1802, but did you use the Ferranti F100L? Rationale
    for the use of both, mid/late 70s they were the only processors of their
    respective type with military approvals.


    Can't remember how we coded on the F100L, but the 1802 work was done on
    the Texas Instruments Silent 700, copying from one cassette tape to
    another. Set the controls wrong when copying and whoops, you've just
    overwritten the work you've just done. We could have had a decent
    development environment but it was on a UK MOD cost plus project, so the
    more inefficiently you worked, the more profit your employer made.


    --
    Python is the second best programming language in the world.
    But the best has yet to be invented. Christian Tismer


    Mark Lawrence
  • Gene Heskett at Nov 15, 2013 at 7:23 pm

    On Friday 15 November 2013 13:52:40 Mark Lawrence did opine:

    On 15/11/2013 16:36, Gene Heskett wrote:
    On Friday 15 November 2013 11:28:19 Joel Goldstick did opine:
    On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker <robin@reportlab.com>
    wrote:
    ...........
    became popular.
    Really? you cried and laughed over 7 vs. 8 bits? That's lovely
    (?). ;). That eighth bit sure was less confusing than codepoint
    translations
    no we had 6 bits in 60 bit words as I recall; extracting the nth
    character involved division by 6; smart people did tricks with
    inverted multiplications etc etc :(
    --
    Cool, someone here is older than me! I came in with the 8080, and I
    remember split octal, but sixes are something I missed out on.
    Ok, if you are feeling old & decrepit, hows this for a birthday:
    10/04/34, I came into micro computers about RCA 1802 time. Wrote a
    program for the 1802 without an assembler, for tape editing in '78 at
    KRCR-TV in Redding CA, that was still in use in '94, but never really
    wrote assembly code until the 6809 was out in the Radio Shack Color
    Computers. os9 on the coco's was the best teacher about the unix way
    of doing things there ever was. So I tell folks these days that I am
    39, with 40 years experience at being 39. ;-)
    Robin Becker
    Cheers, Gene
    I also used the RCA 1802, but did you use the Ferranti F100L? Rationale
    for the use of both, mid/late 70s they were the only processors of their
    respective type with military approvals.

    Can't remember how we coded on the F100L, but the 1802 work was done on
    the Texas Instruments Silent 700, copying from one cassette tape to
    another. Set the controls wrong when copying and whoops, you've just
    overwritten the work you've just done. We could have had a decent
    development environment but it was on a UK MOD cost plus project, so the
    more inefficiently you worked, the more profit your employer made.

    BTDT but in 1959-60 era. Testing the ullage pressure regulators for the
    early birds, including some that gave John Glenn his first ride or 2. I
    don't recall the brand of paper tape recorders, but they used 12at7's &
    12au7's by the grocery sack full. One or more got noisy & me being the
    budding C.E.T. that I now am, of course ran down the bad ones and requested
    new ones. But you had to turn in the old ones, which Stellardyne Labs
    simply recycled back to you the next time you needed a few. Hopeless
    management IMO, but thats cost plus for you.


    At 10k$ a truckload for helium back then, each test lost about $3k worth of
    helium because the recycle catcher tank was so thin walled. And the 6
    stage cardox re-compressor was so leaky, occasionally blowing up a pipe out
    of the last stage that put about 7800 lbs back in the monel tanks.


    I considered that a huge waste compared to the cost of a 12au7, then about
    $1.35, and raised hell, so I got fired. They simply did not care that a
    perfectly good regulator was being abused to death when it took 10 or more
    test runs to get one good recording for the certification. At those
    operating pressures, the valve faces erode just like the seats in your
    shower faucets do in 20 years. Ten such runs and you may as well bin it,
    but they didn't.


    I am amazed that as many of those birds worked as did. Of course if it
    wasn't manned, they didn't talk about the roman candles on the launch pads.
    I heard one story that they had to regrade one pads real estate at
    Vandenburg & start all over, seems some ID10T had left the cable to the
    explosive bolts hanging on the cable tower. Ooops, and theres no off
    switch in many of those once the umbilical has been dropped.


    Cheers, Gene
    --
    "There are four boxes to be used in defense of liberty:
      soap, ballot, jury, and ammo. Please use in that order."
    -Ed Howdershelt (Author)


    Tehee quod she, and clapte the wyndow to.
       -- Geoffrey Chaucer
    A pen in the hand of this president is far more
    dangerous than 200 million guns in the hands of
              law-abiding citizens.
  • Chris Angelico at Nov 15, 2013 at 3:08 pm

    On Sat, Nov 16, 2013 at 1:43 AM, Robin Becker wrote:
    ..........
    I'm still stuck on Python 2, and while I can understand the controversy
    ("It breaks my Python 2 code!"), this seems like the right thing to have
    done. In Python 2, unicode is an add-on. One of the big design drivers in
    Python 3 was to make unicode the standard.

    The idea behind repr() is to provide a "just plain text" representation of
    an object. In P2, "just plain text" means ascii, so escaping non-ascii
    characters makes sense. In P3, "just plain text" means unicode, so escaping
    non-ascii characters no longer makes sense.
    unfortunately the word 'printable' got into the definition of repr; it's
    clear that printability is not the same as unicode at least as far as the
    print function is concerned. In my opinion it would have been better to
    leave the old behaviour as that would have eased the compatibility.

    "Printable" means many different things in different contexts. In some
    contexts, the sequence \x66\x75\x63\x6b is considered unprintable, yet
    each of those characters is perfectly displayable in its natural form.
    Under IDLE, non-BMP characters can't be displayed (or at least, that's
    how it has been; I haven't checked current status on that one). On
    Windows, the console runs in codepage 437 by default (again, I may be
    wrong here), so anything not representable in that has to be escaped.
    My Linux box has its console set to full Unicode, everything working
    perfectly, so any non-control character can be printed. As far as
    Python's concerned, all of that is outside - something is "printable"
    if it's printable within Unicode, and the other hassles are matters of
    encoding. (Except the first one. I don't think there's an encoding
    "g-rated".)

    The python gods don't count that sort of thing as important enough so we get
    the mess that is the python2/3 split. ReportLab has to do both so it's a
    real issue; in addition swapping the str - unicode pair to bytes str doesn't
    help one's mental models either :(

    That's fixing, in effect, a long-standing bug - of a sort. The name
    "str" needs to be applied to the most normal string type. As of Python
    3, that's a Unicode string, which is as it should be. In Python 2, it
    was the ASCII/bytes string, which still fit the description of "most
    normal string type", but that means that Python 2 programs are
    Unicode-unaware by default, which is a flaw. Hence the Py3 fix.

    Things went wrong when utf8 was not adopted as the standard encoding thus
    requiring two string types, it would have been easier to have a len function
    to count bytes as before and a glyphlen to count glyphs. Now as I understand
    it we have a complicated mess under the hood for unicode objects so they
    have a variable representation to approximate an 8 bit representation when
    suitable etc etc etc.

    http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/


    There are languages that do what you describe. It's very VERY easy to
    break stuff. What happens when you slice a string?

    foo = "asdf"
    foo[:2],foo[2:]
    ('as', 'df')

    foo = "q\u1234zy"
    foo[:2],foo[2:]
    ('q?', 'zy')


    Looks good to me. I split a four-character string, I get two
    one-character strings. If that had been done in UTF-8, either I would
    need to know "don't split at that boundary, that's between bytes in a
    character", or else the indexing and slicing would have to be done by
    counting characters from the beginning of the string - an O(n)
    operation, rather than an O(1) pointer arithmetic, not to mention that
    it'll blow your CPU cache (touching every part of a potentially-long
    string) just to find the position.


    The only reliable way to manage things is to work with true Unicode.
    You can completely ignore the internal CPython representation; what
    matters is that in Python (any implementation, as long as it conforms
    with version 3.3 or later) lets you index Unicode codepoints out of a
    Unicode string, without differentiating between those that happen to
    be ASCII, those that fit in a single byte, those that fit in two
    bytes, and those that are flagged RTL, because none of those
    considerations makes any difference to you.


    It takes some getting your head around, but it's worth it - same as
    using git instead of a Windows shared drive. (I'm still trying to push
    my family to think git.)


    ChrisA
  • Ned Batchelder at Nov 15, 2013 at 3:08 pm

    On Friday, November 15, 2013 9:43:17 AM UTC-5, Robin Becker wrote:
    Things went wrong when utf8 was not adopted as the standard encoding thus
    requiring two string types, it would have been easier to have a len function to
    count bytes as before and a glyphlen to count glyphs. Now as I understand it we
    have a complicated mess under the hood for unicode objects so they have a
    variable representation to approximate an 8 bit representation when suitable etc
    etc etc.

    Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it.


    --Ned.

    --
    Robin Becker
  • Robin Becker at Nov 15, 2013 at 3:39 pm
    .........
    Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it.

    --Ned.
    .......
    I don't think that's what I said; the flexible representation is just an added
    complexity that has come about because of the wish to store strings in a compact
    way. The requirement for such complexity is the unicode type itself (especially
    the storage requirements) which necessitated some remedial action.


    There's no point in fighting the change to using unicode. The type wasn't
    required for any technical reason as other languages didn't go this route and
    are reasonably ok, but there's no doubt the change made things more difficult.
    --
    Robin Becker
  • Antoon Pardon at Nov 15, 2013 at 3:49 pm

    Op 15-11-13 16:39, Robin Becker schreef:
    .........
    Dealing with bytes and Unicode is complicated, and the 2->3 transition
    is not easy, but let's please not spread the misunderstanding that
    somehow the Flexible String Representation is at fault. However you
    store Unicode code points, they are different than bytes, and it is
    complex having to deal with both. You can't somehow make the
    dichotomy go away, you can only choose where you want to think about it.

    --Ned.
    .......
    I don't think that's what I said; the flexible representation is just an
    added complexity ...

    No it is not, at least not for python programmers. (It of course is for
    the python implementors). The python programmer doesn't have to care
    about the flexible representation, just as the python programmer doesn't
    have to care about the internal reprensentation of (long) integers. It
    is an implemantation detail that is mostly ignorable.


    --
    Antoon Pardon
  • Chris Angelico at Nov 15, 2013 at 4:01 pm

    On Sat, Nov 16, 2013 at 2:39 AM, Robin Becker wrote:
    Dealing with bytes and Unicode is complicated, and the 2->3 transition is
    not easy, but let's please not spread the misunderstanding that somehow the
    Flexible String Representation is at fault. However you store Unicode code
    points, they are different than bytes, and it is complex having to deal with
    both. You can't somehow make the dichotomy go away, you can only choose
    where you want to think about it.

    --Ned.
    .......
    I don't think that's what I said; the flexible representation is just an
    added complexity that has come about because of the wish to store strings in
    a compact way. The requirement for such complexity is the unicode type
    itself (especially the storage requirements) which necessitated some
    remedial action.

    There's no point in fighting the change to using unicode. The type wasn't
    required for any technical reason as other languages didn't go this route
    and are reasonably ok, but there's no doubt the change made things more
    difficult.

    There's no perceptible difference between a 3.2 wide build and the 3.3
    flexible representation. (Differences with narrow builds are bugs, and
    have now been fixed.) As far as your script's concerned, Python 3.3
    always stores strings in UTF-32, four bytes per character. It just
    happens to be way more efficient on memory, most of the time.


    Other languages _have_ gone for at least some sort of Unicode support.
    Unfortunately quite a few have done a half-way job and use UTF-16 as
    their internal representation. That means there's no difference
    between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled
    differently. ECMAScript actually specifies the perverse behaviour of
    treating codepoints >U+FFFF as two elements in a string, because it's
    just too costly to change.


    There are a small number of languages that guarantee correct Unicode
    handling. I believe bash scripts get this right (though I haven't
    tested; string manipulation in bash isn't nearly as rich as a proper
    text parsing language, so I don't dig into it much); Pike is a very
    Python-like language, and PEP 393 made Python even more Pike-like,
    because Pike's string has been variable width for as long as I've
    known it. A handful of other languages also guarantee UTF-32
    semantics. All of them are really easy to work with; instead of
    writing your code and then going "Oh, I wonder what'll happen if I
    give this thing weird characters?", you just write your code, safe in
    the knowledge that there is no such thing as a "weird character"
    (except for a few in the ASCII set... you may find that code breaks if
    given a newline in the middle of something, or maybe the slash
    confuses you).


    Definitely don't fight the change to Unicode, because it's not a
    change at all... it's just fixing what was buggy. You already had a
    difference between bytes and characters, you just thought you could
    ignore it.


    ChrisA
  • Neil Cerutti at Nov 15, 2013 at 5:47 pm

    On 2013-11-15, Chris Angelico wrote:
    Other languages _have_ gone for at least some sort of Unicode
    support. Unfortunately quite a few have done a half-way job and
    use UTF-16 as their internal representation. That means there's
    no difference between U+0012, U+0123, and U+1234, but U+12345
    suddenly gets handled differently. ECMAScript actually
    specifies the perverse behaviour of treating codepoints >U+FFFF
    as two elements in a string, because it's just too costly to
    change.

    The unicode support I'm learning in Go is, "Everything is utf-8,
    right? RIGHT?!?" It also has the interesting behavior that
    indexing strings retrieves bytes, while iterating over them
    results in a sequence of runes.


    It comes with support for no encodings save utf-8 (natively) and
    utf-16 (if you work at it). Is that really enough?


    --
    Neil Cerutti
  • Steven D'Aprano at Nov 16, 2013 at 1:09 am

    On Fri, 15 Nov 2013 17:47:01 +0000, Neil Cerutti wrote:


    The unicode support I'm learning in Go is, "Everything is utf-8, right?
    RIGHT?!?" It also has the interesting behavior that indexing strings
    retrieves bytes, while iterating over them results in a sequence of
    runes.

    It comes with support for no encodings save utf-8 (natively) and utf-16
    (if you work at it). Is that really enough?

    Only if you never need to handle data created by other applications.






    --
    Steven
  • Steven D'Aprano at Nov 15, 2013 at 5:10 pm

    On Fri, 15 Nov 2013 14:43:17 +0000, Robin Becker wrote:


    Things went wrong when utf8 was not adopted as the standard encoding
    thus requiring two string types, it would have been easier to have a len
    function to count bytes as before and a glyphlen to count glyphs. Now as
    I understand it we have a complicated mess under the hood for unicode
    objects so they have a variable representation to approximate an 8 bit
    representation when suitable etc etc etc.

    No no no! Glyphs are *pictures*, you know the little blocks of pixels
    that you see on your monitor or printed on a page. Before you can count
    glyphs in a string, you need to know which typeface ("font") is being
    used, since fonts generally lack glyphs for some code points.


    [Aside: there's another complication. Some fonts define alternate glyphs
    for the same code point, so that the design of (say) the letter "a" may
    vary within the one string according to whatever typographical rules the
    font supports and the application calls. So the question is, when you
    "count glyphs", should you count "a" and "alternate a" as a single glyph
    or two?]


    You don't actually mean count glyphs, you mean counting code points
    (think characters, only with some complications that aren't important for
    the purposes of this discussion).


    UTF-8 is utterly unsuited for in-memory storage of text strings, I don't
    care how many languages (Go, Haskell?) make that mistake. When you're
    dealing with text strings, the fundamental unit is the character, not the
    byte. Why do you care how many bytes a text string has? If you really
    need to know how much memory an object is using, that's where you use
    sys.getsizeof(), not len().


    We don't say len({42: None}) to discover that the dict requires 136
    bytes, why would you use len("he?vy") to learn that it uses 23 bytes?


    UTF-8 is variable width encoding, which means it's *rubbish* for the in-
    memory representation of strings. Counting characters is slow. Slicing is
    slow. If you have mutable strings, deleting or inserting characters is
    slow. Every operation has to effectively start at the beginning of the
    string and count forward, lest it split bytes in the middle of a UTF
    unit. Or worse, the language doesn't give you any protection from this at
    all, so rather than slow string routines you have unsafe string routines,
    and it's your responsibility to detect UTF boundaries yourself.


    In case you aren't familiar with what I'm talking about, here's an
    example using Python 3.2, starting with a Unicode string and treating it
    as UTF-8 bytes:


    py> u = "he?vy"
    py> s = u.encode('utf-8')
    py> for c in s:
    ... print(chr(c))
    ...
    h
    e
    ?
    ?
    v
    y




    "??"? It didn't take long to get moji-bake in our output, and all I did
    was print the (byte) string one "character" at a time. It gets worse: we
    can easily end up with invalid UTF-8:


    py> a, b = s[:len(s)//2], s[len(s)//2:] # split the string in half
    py> a.decode('utf-8')
    Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2:
    unexpected end of data
    py> b.decode('utf-8')
    Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0:
    invalid start byte




    No, UTF-8 is okay for writing to files, but it's not suitable for text
    strings. The in-memory representation of text strings should be constant
    width, based on characters not bytes, and should prevent the caller from
    accidentally ending up with moji-bake or invalid strings.




    --
    Steven
  • Chris Angelico at Nov 15, 2013 at 5:29 pm

    On Sat, Nov 16, 2013 at 4:10 AM, Steven D'Aprano wrote:
    No, UTF-8 is okay for writing to files, but it's not suitable for text
    strings.

    Correction: It's _great_ for writing to files (and other fundamentally
    byte-oriented streams, like network connections). Does a superb job as
    the default encoding for all sorts of situations. But, as you say, it
    sucks if you want to find the Nth character.


    ChrisA
  • Cousin Stanley at Nov 15, 2013 at 5:45 pm

    ....
    We don't say len({42: None}) to discover
    that the dict requires 136 bytes,
    why would you use len("he?vy")
    to learn that it uses 23 bytes ?
    ....

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-


    """
         illustrate the difference in length of python objects
         and the size of their system storage
    """


    import sys


    s = "he?vy"


    d = { 42 : None }


    print
    print ' s : %s' % s
    print ' len( s ) : %d' % len( s )
    print ' sys.getsizeof( s ) : %s ' % sys.getsizeof( s )
    print
    print
    print ' d : ' , d
    print ' len( d ) : %d' % len( d )
    print ' sys.getsizeof( d ) : %d ' % sys.getsizeof( d )




    --
    Stanley C. Kitching
    Human Being
    Phoenix, Arizona
  • Random832 at Nov 15, 2013 at 6:16 pm
    Of course, the real solution to this issue is to replace sys.stdout on
    windows with an object that can handle Unicode directly with the
    WriteConsoleW function - the problem there is that it will break code
    that expects to be able to use sys.stdout.buffer for binary I/O. I also
    wasn't able to get the analogous stdin replacement class to work with
    input() in my attempts.
  • Robin Becker at Nov 18, 2013 at 11:47 am

    On 15/11/2013 18:16, random832 at fastmail.us wrote:
    Of course, the real solution to this issue is to replace sys.stdout on
    windows with an object that can handle Unicode directly with the
    WriteConsoleW function - the problem there is that it will break code
    that expects to be able to use sys.stdout.buffer for binary I/O. I also
    wasn't able to get the analogous stdin replacement class to work with
    input() in my attempts.
    I started to use this on my windows installation




    #c:\python33\lib\site-packages\sitecustomize.py
    import sys, codecs
    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.detach())


    which makes them writable with any unicode; after many years I am quite used to
    garbage appearing in the windows console.


    Unfortunately the above doesn't go into virtual environments, but I assume a
    hacked site.py could do that.
    --
    Robin Becker
  • Robin Becker at Nov 18, 2013 at 12:33 pm
    On 18/11/2013 11:47, Robin Becker wrote:
    ...........
    #c:\python33\lib\site-packages\sitecustomize.py
    import sys, codecs
    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.detach())
    ........
    it seems that the above needs extra stuff to make some distutils logging work
    etc etc; so now I'm using sitecustomize.py containing


    import sys, codecs
    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
    sys.stdout.encoding = 'utf8'
    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.detach())
    sys.stderr.encoding = 'utf8'


    --
    Robin Becker
  • Nick Coghlan at Nov 18, 2013 at 12:59 pm

    On 18 Nov 2013 22:36, "Robin Becker" wrote:
    On 18/11/2013 11:47, Robin Becker wrote:
    ...........
    #c:\python33\lib\site-packages\sitecustomize.py
    import sys, codecs
    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.detach())
    ........
    it seems that the above needs extra stuff to make some distutils logging
    work etc etc; so now I'm using sitecustomize.py containing
    import sys, codecs
    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
    sys.stdout.encoding = 'utf8'
    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.detach())
    sys.stderr.encoding = 'utf8'

    Note that calling detach() on the standard streams isn't officially
    supported, since it breaks the shadow streams saved in sys.__stderr__, etc.


    Cheers,
    Nick.

    --
    Robin Becker

    _______________________________________________
    Python-ideas mailing list
    Python-ideas at python.org
    https://mail.python.org/mailman/listinfo/python-ideas
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20131118/04fd4913/attachment.html>
  • Victor Stinner at Nov 18, 2013 at 3:25 pm
    Why do you need to force the UTF-8 encoding? Your locale is not
    correctly configured?


    It's better to set PYTHONIOENCODING rather than replacing
    sys.stdout/stderr at runtime.


    There is an open issue to add a TextIOWrapper.set_encoding() method:
    http://bugs.python.org/issue15216


    Victor
  • Robin Becker at Nov 18, 2013 at 4:08 pm

    On 18/11/2013 15:25, Victor Stinner wrote:
    Why do you need to force the UTF-8 encoding? Your locale is not
    correctly configured?

    It's better to set PYTHONIOENCODING rather than replacing
    sys.stdout/stderr at runtime.

    There is an open issue to add a TextIOWrapper.set_encoding() method:
    http://bugs.python.org/issue15216

    Victor
    well reportlab does all sorts of character sets and languages; if I put in a
    quick print to try and debug stuff I prefer that it create some output rather
    than create an error of its own. In the real world it's not possible always to
    know what the output contains (especially in error cases) so having any
    restriction on the allowed textual outputs is a bit constraining.


    The utf8 encoding should allow any unicode to be properly encoded, rendering is
    another issue and I expect some garbage when things are going wrong.


    I think you are right and I should use PYTHONIOENCODING to set this up. In the
    codec writer approach I think it's harder to get interactive behaviour working
    properly (the output seems to be buffered differently). My attempts to make
    windows xp use code page 65001 everywhere have been fairly catastrophic eg
    non-booting :(
    --
    Robin Becker
  • Random832 at Nov 18, 2013 at 10:30 pm

    On Mon, Nov 18, 2013, at 7:33, Robin Becker wrote:
    UTF-8 stuff

    This doesn't really solve the issue I was referring to, which is that
    windows _console_ (i.e. not redirected file or pipe) I/O can only
    support unicode via wide character (UTF-16) I/O with a special function,
    not via using byte-based I/O with the normal write function.
  • Andrew Barnert at Nov 19, 2013 at 5:27 am
    From: "random832 at fastmail.us" <random832@fastmail.us>





    On Mon, Nov 18, 2013, at 7:33, Robin Becker wrote:
    UTF-8 stuff
    This doesn't really solve the issue I was referring to, which is that
    windows _console_ (i.e. not redirected file or pipe) I/O can only
    support unicode via wide character (UTF-16) I/O with a special function,
    not via using byte-based I/O with the normal write function.



    The problem is that Windows 16-bit I/O doesn't fit into the usual io module hierarchy. Not because it uses an encoding of UTF-16 (although anyone familiar with ReadConsoleW/WriteConsoleW from other languages may be a bit confused that Python's lowest-level wrappers around them deal in byte counts instead of WCHAR counts), but because you have to use HANDLEs instead of fds. So, there are going to be some compromises and some complexity.


    One possibility is to use as much of the io hierarchy as possible, but not try to make it flexible enough to be reusable for arbitrary HANDLEs: Add?WindowsFileIO and WindowsConsoleIO classes that implement RawIOBase with a native HANDLE and ReadFile/WriteFile and ReadConsoleW/WriteConsoleW respectively. Both work in terms of bytes (which means WindowsConsoleIO.read has to //2 its argument, and write has to *2 the result). You also need a create_windows_io function that wraps a HANDLE by calling GetConsoleMode and constructing a WindowsConsoleIO or WindowsFileIO as appropriate, then creates a BufferedReader/Writer around that, then constructs a TextIOWrapper with UTF-16 or the default encoding around that. At startup, you just do that for the three GetStdHandle handles, and that's your stdin, stdout, and stderr.


    Besides not being reusable enough for people who want to wrap HANDLEs from other libraries or attach to new consoles from Python, it's not clear what fileno() should return. You could fake it and return the MSVCRT fds that correspond to the same files as the HANDLEs, but it's possible to end up with one redirected and not the other (e.g., if you detach the console), and I'm not sure what happens if you mix and match the two. A more "correct" solution would be to call _open_osfhandle on?the HANDLE (and then keep track of the fact that os.close closes the HANDLE, or leave it up to the user to deal with bad handle errors?), but I'm not sure that's any better in practice. Also, should a console HANDLE use _O_WTEXT for its fd (in which case the user has to know that he has a _O_WTEXT handle even though there's no way to see that from Python), or not (in which case he's mixing 8-bit and 16-bit I/O on the same file)?


    It might be reasonable to just not expose fileno(); most code that wants the fileno() for stdin is just going to do something Unix-y that's not going to work anyway (select it, tcsetattr it, pass it over a socket to another file, ?).


    A different approach would be to reuse as _little_ of io as possible, instead of as much: Windows stdin/stdout/stderr could each be custom TextIOBase implementations that work straight on HANDLEs and don't?even support buffer (or detach), much less fileno. That exposes even less functionality to users, of course. It also means we need a parallel implementation of all the buffering logic. (On the other hand, it also leaves the door open to expose some Windows functionality, like async ReadFileEx/WriteFileEx, in a way that would be very hard through the normal layers?)




    It shouldn't be too hard to write most of these via an extension module or ctypes to experiment with it. As long as you're careful not to mix winsys.stdout and sys.stdout (the module could even set sys.stdin, sys.stdout, sys.stderr=stdin, stdout, stderr at import time, or just del them, for a bit of protection), it should work.


    It might be worth implementing a few different designs to play with, and putting them through their paces with some modules and scripts that do different things with stdio (including running the scripts with cmd.exe redirected I/O and with subprocess PIPEs) to see which ones have problems or limitations that are hard to foresee in advance.


    If you have a design that you think sounds good, and are willing to experiment the hell out of it, and don't know how to get started but would be willing to debug and finish a mostly-written/almost-working implementation, I could slap something together with ctypes to get you started.
  • Terry Reedy at Nov 15, 2013 at 11:49 pm

    On 11/15/2013 6:28 AM, Robin Becker wrote:
    I'm trying to understand what's going on with this simple program

    if __name__=='__main__':
    print("repr=%s" % repr(u'\xc1'))
    print("%%r=%r" % u'\xc1')

    On my windows XP box this fails miserably if run directly at a terminal

    C:\tmp> \Python33\python.exe bang.py
    Traceback (most recent call last):
    File "bang.py", line 2, in <module>
    print("repr=%s" % repr(u'\xc1'))
    File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in
    position 6: character maps to <undefined>

    If I run the program redirected into a file then no error occurs and the
    the result looks like this

    C:\tmp>cat fff
    repr='?'
    %r='?'

    and if I run it into a pipe it works as though into a file.

    It seems that repr thinks it can render u'\xc1' directly which is a
    problem since print then seems to want to convert that to cp437 if
    directed into a terminal.

    I find the idea that print knows what it's printing to a bit dangerous,

    print() just calls file.write(s), where file defaults to sys.stdout, for
    each string fragment it creates. write(s) *has* to encode s to bytes
    according to some encoding, and it uses the encoding associated with the
    file when it was opened.

    but it's the repr behaviour that strikes me as bad.

    What is responsible for defining the repr function's 'printable'
    so that repr would give me say an Ascii rendering?

    That is not repr's job. Perhaps you are looking for
    repr(u'\xc1')
    "'?'"
    ascii(u'\xc1')
    "'\\xc1'"
    The above is with Idle on Win7. It is *much* better than the
    intentionally crippled console for working with the BMP subset of unicode.


    --
    Terry Jan Reedy

Related Discussions

People

Translate

site design / logo © 2022 Grokbase