FAQ
In Python 2, open() opens the file in binary mode (e.g. file.readline()
returns a byte string). codecs.open() opens the file in binary mode by
default, you have to specify an encoding name to open it in text mode.

In Python 3, open() opens the file in text mode by default. (It only
opens the binary mode if the file mode contains "b".) The problem is
that open() uses the locale encoding if the encoding is not specified,
which is the case *by default*. The locale encoding can be:

- UTF-8 on Mac OS X, most Linux distributions
- ISO-8859-1 os some FreeBSD systems
- ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
Western Europe, cp952 in Japan, ...
- ASCII if the locale is manually set to an empty string or to "C", or
if the environment is empty, or by default on some systems
- something different depending on the system and user configuration...

If you develop under Mac OS X or Linux, you may have surprises when you
run your program on Windows on the first non-ASCII character. You may
not detect the problem if you only write text in english... until
someone writes the first letter with a diacritic.



As discussed before on this list, I propose to set the default encoding
of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
open() is called without an explicit encoding and if the locale encoding
is not UTF-8. Using the warning, you will quickly notice the potential
problem (using Python 3.2.2 and -Werror) on Windows or by using a
different locale encoding (.e.g using LANG="C").

I expect a lot of warnings from the Python standard library, and as many
in third party modules and applications. So do you think that it is too
late to change that in Python 3.3? One argument for changing it directly
in Python 3.3 is that most users will not notice the change because
their locale encoding is already UTF-8.

An alternative is to:
- Python 3.2: use the locale encoding but emit a warning if the locale
encoding is not UTF-8
- Python 3.3: use UTF-8 and emit a warning if the locale encoding is
not UTF-8... or maybe always emit a warning?
- Python 3.3: use UTF-8 (but don't emit warnings anymore)

I don't think that Windows developer even know that they are writing
files into the ANSI code page. MSDN documentation of
WideCharToMultiByte() warns developer that the ANSI code page is not
portable, even accross Windows computers:

"The ANSI code pages can be different on different computers, or can be
changed for a single computer, leading to data corruption. For the most
consistent results, applications should use Unicode, such as UTF-8 or
UTF-16, instead of a specific code page, unless legacy standards or data
formats prevent the use of Unicode. If using Unicode is not possible,
applications should tag the data stream with the appropriate encoding
name when protocols allow it. HTML and XML files allow tagging, but text
files do not."

It will always be possible to use ANSI code page using
encoding="mbcs" (only work on Windows), or an explicit code page number
(e.g. encoding="cp2152").

--

The two other (rejetected?) options to improve open() are:

- raise an error if the encoding argument is not set: will break most
programs
- emit a warning if the encoding argument is not set

--

Should I convert this email into a PEP, or is it not required?

Victor

Search Discussions

  • M.-A. Lemburg at Jun 28, 2011 at 2:02 pm

    Victor Stinner wrote:
    In Python 2, open() opens the file in binary mode (e.g. file.readline()
    returns a byte string). codecs.open() opens the file in binary mode by
    default, you have to specify an encoding name to open it in text mode.

    In Python 3, open() opens the file in text mode by default. (It only
    opens the binary mode if the file mode contains "b".) The problem is
    that open() uses the locale encoding if the encoding is not specified,
    which is the case *by default*. The locale encoding can be:

    - UTF-8 on Mac OS X, most Linux distributions
    - ISO-8859-1 os some FreeBSD systems
    - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
    Western Europe, cp952 in Japan, ...
    - ASCII if the locale is manually set to an empty string or to "C", or
    if the environment is empty, or by default on some systems
    - something different depending on the system and user configuration...

    If you develop under Mac OS X or Linux, you may have surprises when you
    run your program on Windows on the first non-ASCII character. You may
    not detect the problem if you only write text in english... until
    someone writes the first letter with a diacritic.
    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ?

    That'll make it compatible to the Py2 world again and avoid
    all the encoding guessing.

    Making such default encodings depend on the locale has already
    failed to work when we first introduced a default encoding in
    Py2, so I don't understand why we are repeating the same
    mistake again in Py3 (only in a different area).

    Note that in Py2, Unix applications often leave out the 'b'
    mode, since there's no difference between using it or not.
    Only on Windows, you'll see a difference.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Jun 28 2011)
    Python/Zope Consulting and Support ... http://www.egenix.com/
    mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
    ________________________________________________________________________

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
  • Terry Reedy at Jun 28, 2011 at 2:36 pm

    On 6/28/2011 10:02 AM, M.-A. Lemburg wrote:

    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ?

    That'll make it compatible to the Py2 world again
    I disagree. I believe
    S = open('myfile.txt').read()
    now return a text string in both Py2 and Py3 and a subsequent
    'abc' in S
    works in both.
    and avoid all the encoding guessing.
    Making such default encodings depend on the locale has already
    failed to work when we first introduced a default encoding in
    Py2, so I don't understand why we are repeating the same
    mistake again in Py3 (only in a different area).
    I do not remember any proposed change during the Py3 design discussions.
    Note that in Py2, Unix applications often leave out the 'b'
    mode, since there's no difference between using it or not.
    I believe it makes a difference now as to whether one gets str or bytes.
    Only on Windows, you'll see a difference.
    I believe the only difference now on Windows is the decoding used, not
    the return type.

    --
    Terry Jan Reedy
  • Michael Foord at Jun 28, 2011 at 2:48 pm

    On 28/06/2011 15:36, Terry Reedy wrote:
    On 6/28/2011 10:02 AM, M.-A. Lemburg wrote:

    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ?

    That'll make it compatible to the Py2 world again
    I disagree. I believe
    S = open('myfile.txt').read()
    now return a text string in both Py2 and Py3 and a subsequent
    'abc' in S
    works in both.
    Nope, it returns a bytestring in Python 2. Mistakenly treating
    bytestrings as text is one of the things we aimed to correct in the
    transition to Python 3.

    Michael
    and avoid all the encoding guessing.
    Making such default encodings depend on the locale has already
    failed to work when we first introduced a default encoding in
    Py2, so I don't understand why we are repeating the same
    mistake again in Py3 (only in a different area).
    I do not remember any proposed change during the Py3 design discussions.
    Note that in Py2, Unix applications often leave out the 'b'
    mode, since there's no difference between using it or not.
    I believe it makes a difference now as to whether one gets str or bytes.
    Only on Windows, you'll see a difference.
    I believe the only difference now on Windows is the decoding used, not
    the return type.

    --
    http://www.voidspace.org.uk/

    May you do good and not evil
    May you find forgiveness for yourself and forgive others
    May you share freely, never taking more than you give.
    -- the sqlite blessing http://www.sqlite.org/different.html
  • Terry Reedy at Jun 28, 2011 at 4:34 pm

    On 6/28/2011 10:48 AM, Michael Foord wrote:
    On 28/06/2011 15:36, Terry Reedy wrote:

    S = open('myfile.txt').read()
    now return a text string in both Py2 and Py3 and a subsequent
    'abc' in S
    works in both.
    Nope, it returns a bytestring in Python 2.
    Which, in Py2 is a str() object. In both Pythons, .read() in default
    mode returns an object of type str() and 'abc' is an object of type
    str() and so expressions involving undecorated string literals and input
    just work, but would not work if input defaulted to bytes in Py 3. Sorry
    if I was not clear enough.

    --
    Terry Jan Reedy
  • Michael Foord at Jun 28, 2011 at 4:50 pm

    On 28/06/2011 17:34, Terry Reedy wrote:
    On 6/28/2011 10:48 AM, Michael Foord wrote:
    On 28/06/2011 15:36, Terry Reedy wrote:

    S = open('myfile.txt').read()
    now return a text string in both Py2 and Py3 and a subsequent
    'abc' in S
    works in both.
    Nope, it returns a bytestring in Python 2.
    Which, in Py2 is a str() object.
    Yes, but not a "text string". The equivalent of the Python 2 str in
    Python 3 is bytes. Irrelevant discussion anyway.
    In both Pythons, .read() in default mode returns an object of type
    str() and 'abc' is an object of type str() and so expressions
    involving undecorated string literals and input just work, but would
    not work if input defaulted to bytes in Py 3. Sorry if I was not clear
    enough.
    Well, I think you're both right. Both semantics break some assumption or
    other.

    All the best,

    Michael

    --
    http://www.voidspace.org.uk/

    May you do good and not evil
    May you find forgiveness for yourself and forgive others
    May you share freely, never taking more than you give.
    -- the sqlite blessing http://www.sqlite.org/different.html
  • Ethan Furman at Jun 28, 2011 at 6:55 pm

    Michael Foord wrote:
    On 28/06/2011 17:34, Terry Reedy wrote:
    On 6/28/2011 10:48 AM, Michael Foord wrote:
    On 28/06/2011 15:36, Terry Reedy wrote:

    S = open('myfile.txt').read()
    now return a text string in both Py2 and Py3 and a subsequent
    'abc' in S
    works in both.
    Nope, it returns a bytestring in Python 2.
    Which, in Py2 is a str() object.
    Yes, but not a "text string". The equivalent of the Python 2 str in
    Python 3 is bytes. Irrelevant discussion anyway.
    Irrelevant to the OP, yes, but a Python 2 string *is not* the same as
    Python 3 bytes. If you don't believe me fire up your Python 3 shell and
    try b'xyz'[1] == 'y'.

    ~Ethan~
  • Ethan Furman at Jun 28, 2011 at 7:06 pm

    Ethan Furman wrote:
    Michael Foord wrote:
    On 28/06/2011 17:34, Terry Reedy wrote:
    On 6/28/2011 10:48 AM, Michael Foord wrote:
    On 28/06/2011 15:36, Terry Reedy wrote:

    S = open('myfile.txt').read()
    now return a text string in both Py2 and Py3 and a subsequent
    'abc' in S
    works in both.
    Nope, it returns a bytestring in Python 2.
    Which, in Py2 is a str() object.
    Yes, but not a "text string". The equivalent of the Python 2 str in
    Python 3 is bytes. Irrelevant discussion anyway.
    Irrelevant to the OP, yes, but a Python 2 string *is not* the same as
    Python 3 bytes. If you don't believe me fire up your Python 3 shell and
    try b'xyz'[1] == 'y'.
    er, make that b'xyz'[1] == b'y' :(
  • Bill Janssen at Jun 28, 2011 at 4:53 pm

    Terry Reedy wrote:

    Making such default encodings depend on the locale has already
    failed to work when we first introduced a default encoding in
    Py2, so I don't understand why we are repeating the same
    mistake again in Py3 (only in a different area).
    I do not remember any proposed change during the Py3 design discussions.
    I certainly proposed it, more than once.

    Bill
  • Bill Janssen at Jun 28, 2011 at 4:52 pm

    M.-A. Lemburg wrote:

    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ? +1.
    That'll make it compatible to the Py2 world again and avoid
    all the encoding guessing.
    Yep.

    Bill
  • Tres Seaver at Jun 28, 2011 at 10:29 pm

    On 06/28/2011 12:52 PM, Bill Janssen wrote:
    M.-A. Lemburg wrote:
    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ? +1.
    That'll make it compatible to the Py2 world again and avoid
    all the encoding guessing.
    Yep.
    +1 from me, as well: "in the face of ambiguity, refuse the temptation
    sto guess."


    Tres.
    - --
    ===================================================================
    Tres Seaver +1 540-429-0999 tseaver at palladion.com
    Palladion Software "Excellence by Design" http://palladion.com
  • Victor Stinner at Jun 28, 2011 at 8:51 pm

    Le mardi 28 juin 2011 ? 16:02 +0200, M.-A. Lemburg a ?crit :
    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ?
    I tried your suggested change: Python doesn't start.

    sysconfig uses the implicit locale encoding to read sysconfig.cfg, the
    Makefile and pyconfig.h. I think that it is correct to use the locale
    encoding for Makefile and pyconfig.h, but maybe not for sysconfig.cfg.

    Python require more changes just to run "make". I was able to run "make"
    by using encoding='utf-8' in various functions (of distutils and
    setup.py). I didn't try the test suite, I expect too many failures.

    --

    Then I tried my suggestion (use "utf-8" by default): Python starts
    correctly, I can build it (run "make") and... the full test suite pass
    without any change. (I'm testing on Linux, my locale encoding is UTF-8.)

    Victor
  • M.-A. Lemburg at Jun 29, 2011 at 8:18 am

    Victor Stinner wrote:
    Le mardi 28 juin 2011 ? 16:02 +0200, M.-A. Lemburg a ?crit :
    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ?
    I tried your suggested change: Python doesn't start.
    No surprise there: it's an incompatible change, but one that undoes
    a wart introduced in the Py3 transition. Guessing encodings should
    be avoided whenever possible.
    sysconfig uses the implicit locale encoding to read sysconfig.cfg, the
    Makefile and pyconfig.h. I think that it is correct to use the locale
    encoding for Makefile and pyconfig.h, but maybe not for sysconfig.cfg.

    Python require more changes just to run "make". I was able to run "make"
    by using encoding='utf-8' in various functions (of distutils and
    setup.py). I didn't try the test suite, I expect too many failures.
    This demonstrates that Python's stdlib is still not being explicit
    about the encoding issues. I suppose that things just happen to work
    because we mostly use ASCII files for configuration and setup.
    --

    Then I tried my suggestion (use "utf-8" by default): Python starts
    correctly, I can build it (run "make") and... the full test suite pass
    without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
    I bet it would also with "ascii" in most cases. Which then just
    means that the Python build process and test suite is not a good
    test case for choosing a default encoding.

    Linux is also a poor test candidate for this, since most user setups
    will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of
    code page encodings (usually not UTF-8), so you are likely to hit
    the real problem cases a lot easier.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Jun 29 2011)
    Python/Zope Consulting and Support ... http://www.egenix.com/
    mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
    ________________________________________________________________________

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
  • Victor Stinner at Jun 29, 2011 at 9:50 am

    Le mercredi 29 juin 2011 ? 10:18 +0200, M.-A. Lemburg a ?crit :
    Victor Stinner wrote:
    Le mardi 28 juin 2011 ? 16:02 +0200, M.-A. Lemburg a ?crit :
    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ?
    I tried your suggested change: Python doesn't start.
    No surprise there: it's an incompatible change, but one that undoes
    a wart introduced in the Py3 transition. Guessing encodings should
    be avoided whenever possible.
    It means that all programs written for Python 3.0, 3.1, 3.2 will stop
    working with the new 3.x version (let say 3.3). Users will have to
    migrate from Python 2 to Python 3.2, and then migration from Python 3.2
    to Python 3.3 :-(

    I would prefer a ResourceWarning (emited if the encoding is not
    specified), hidden by default: it doesn't break compatibility, and
    -Werror gives exactly the same behaviour that you expect.
    This demonstrates that Python's stdlib is still not being explicit
    about the encoding issues. I suppose that things just happen to work
    because we mostly use ASCII files for configuration and setup.
    I did more tests. I found some mistakes and sometimes the binary mode
    can be used, but most function really expect the locale encoding (it is
    the correct encoding to read-write files). I agree that it would be to
    have an explicit encoding="locale", but make it mandatory is a little
    bit rude.
    Then I tried my suggestion (use "utf-8" by default): Python starts
    correctly, I can build it (run "make") and... the full test suite pass
    without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
    I bet it would also with "ascii" in most cases. Which then just
    means that the Python build process and test suite is not a good
    test case for choosing a default encoding.

    Linux is also a poor test candidate for this, since most user setups
    will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of
    code page encodings (usually not UTF-8), so you are likely to hit
    the real problem cases a lot easier.
    I also ran the test suite on my patched Python (open uses UTF-8 by
    default) with ASCII locale encoding (LANG=C), the test suite does also
    pass. Many tests uses non-ASCII characters, some of them are skipped if
    the locale encoding is unable to encode the tested text.

    Victor
  • M.-A. Lemburg at Jun 29, 2011 at 10:20 am

    Victor Stinner wrote:
    Le mercredi 29 juin 2011 ? 10:18 +0200, M.-A. Lemburg a ?crit :
    Victor Stinner wrote:
    Le mardi 28 juin 2011 ? 16:02 +0200, M.-A. Lemburg a ?crit :
    How about a more radical change: have open() in Py3 default to
    opening the file in binary mode, if no encoding is given (even
    if the mode doesn't include 'b') ?
    I tried your suggested change: Python doesn't start.
    No surprise there: it's an incompatible change, but one that undoes
    a wart introduced in the Py3 transition. Guessing encodings should
    be avoided whenever possible.
    It means that all programs written for Python 3.0, 3.1, 3.2 will stop
    working with the new 3.x version (let say 3.3). Users will have to
    migrate from Python 2 to Python 3.2, and then migration from Python 3.2
    to Python 3.3 :-(
    I wasn't suggesting doing this for 3.3, but we may want to start
    the usual feature change process to make the change eventually
    happen.
    I would prefer a ResourceWarning (emited if the encoding is not
    specified), hidden by default: it doesn't break compatibility, and
    -Werror gives exactly the same behaviour that you expect.
    ResourceWarning is the wrong type of warning for this. I'd
    suggest to use a UnicodeWarning or perhaps create a new
    EncodingWarning instead.
    This demonstrates that Python's stdlib is still not being explicit
    about the encoding issues. I suppose that things just happen to work
    because we mostly use ASCII files for configuration and setup.
    I did more tests. I found some mistakes and sometimes the binary mode
    can be used, but most function really expect the locale encoding (it is
    the correct encoding to read-write files). I agree that it would be to
    have an explicit encoding="locale", but make it mandatory is a little
    bit rude.
    Again: Using a locale based default encoding will not work out
    in the long run. We've had those discussions many times in the
    past.

    I don't think there's anything bad with having the user require
    to set an encoding if he wants to read text. It makes him/her
    think twice about the encoding issue, which is good.

    And, of course, the stdlib should start using this
    explicit-is-better-than-implicit approach as well.
    Then I tried my suggestion (use "utf-8" by default): Python starts
    correctly, I can build it (run "make") and... the full test suite pass
    without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
    I bet it would also with "ascii" in most cases. Which then just
    means that the Python build process and test suite is not a good
    test case for choosing a default encoding.

    Linux is also a poor test candidate for this, since most user setups
    will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of
    code page encodings (usually not UTF-8), so you are likely to hit
    the real problem cases a lot easier.
    I also ran the test suite on my patched Python (open uses UTF-8 by
    default) with ASCII locale encoding (LANG=C), the test suite does also
    pass. Many tests uses non-ASCII characters, some of them are skipped if
    the locale encoding is unable to encode the tested text.
    Thanks for checking. So the build process and test suite are
    indeed not suitable test cases for the problem at hand. With
    just ASCII files to decode, Python will simply never fail
    to decode the content, regardless of whether you use an ASCII,
    UTF-8 or some Windows code page as locale encoding.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Jun 29 2011)
    Python/Zope Consulting and Support ... http://www.egenix.com/
    mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
    ________________________________________________________________________

    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/
  • Antoine Pitrou at Jun 28, 2011 at 2:06 pm

    On Tue, 28 Jun 2011 15:43:05 +0200 Victor Stinner wrote:
    - ISO-8859-1 os some FreeBSD systems
    - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
    Western Europe, cp952 in Japan, ...
    - ASCII if the locale is manually set to an empty string or to "C", or
    if the environment is empty, or by default on some systems
    - something different depending on the system and user configuration...
    Why would utf-8 be the right thing in these cases?

    Regards

    Antoine.
  • Terry Reedy at Jun 28, 2011 at 2:41 pm

    On 6/28/2011 10:06 AM, Antoine Pitrou wrote:
    On Tue, 28 Jun 2011 15:43:05 +0200
    Victor Stinnerwrote:
    - ISO-8859-1 os some FreeBSD systems
    - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
    Western Europe, cp952 in Japan, ...
    - ASCII if the locale is manually set to an empty string or to "C", or
    if the environment is empty, or by default on some systems
    - something different depending on the system and user configuration...
    Why would utf-8 be the right thing in these cases?
    Because utf-8 is the only way to write out any Python 3 text.
    By default, writing and reading an str object should work on all Python
    installations.
    And because other apps are (increasingly) using it for exactly the same
    reason.


    --
    Terry Jan Reedy
  • Antoine Pitrou at Jun 28, 2011 at 3:02 pm

    On Tue, 28 Jun 2011 10:41:38 -0400 Terry Reedy wrote:
    On 6/28/2011 10:06 AM, Antoine Pitrou wrote:
    On Tue, 28 Jun 2011 15:43:05 +0200
    Victor Stinnerwrote:
    - ISO-8859-1 os some FreeBSD systems
    - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
    Western Europe, cp952 in Japan, ...
    - ASCII if the locale is manually set to an empty string or to "C", or
    if the environment is empty, or by default on some systems
    - something different depending on the system and user configuration...
    Why would utf-8 be the right thing in these cases?
    Because utf-8 is the only way to write out any Python 3 text.
    Er, no, you also have utf-16, utf-32, utf-7 (and possibly others,
    including home-baked encodings).
    By default, writing and reading an str object should work on all Python
    installations.
    But that's only half of the problem. If the text is supposed to be read
    or processed by some other program, then writing it in some encoding
    that the other program doesn't expect doesn't really help. That's why
    we use the locale encoding: because it's a good guess as to what the
    system (and its users) expects text to be encoded in.

    Regards

    Antoine.
  • Terry Reedy at Jun 28, 2011 at 2:24 pm

    On 6/28/2011 9:43 AM, Victor Stinner wrote:
    In Python 2, open() opens the file in binary mode (e.g. file.readline()
    returns a byte string). codecs.open() opens the file in binary mode by
    default, you have to specify an encoding name to open it in text mode.

    In Python 3, open() opens the file in text mode by default. (It only
    opens the binary mode if the file mode contains "b".) The problem is
    that open() uses the locale encoding if the encoding is not specified,
    which is the case *by default*. The locale encoding can be:

    - UTF-8 on Mac OS X, most Linux distributions
    - ISO-8859-1 os some FreeBSD systems
    - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
    Western Europe, cp952 in Japan, ...
    - ASCII if the locale is manually set to an empty string or to "C", or
    if the environment is empty, or by default on some systems
    - something different depending on the system and user configuration...

    If you develop under Mac OS X or Linux, you may have surprises when you
    run your program on Windows on the first non-ASCII character. You may
    not detect the problem if you only write text in english... until
    someone writes the first letter with a diacritic.



    As discussed before on this list, I propose to set the default encoding
    of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
    open() is called without an explicit encoding and if the locale encoding
    is not UTF-8. Using the warning, you will quickly notice the potential
    problem (using Python 3.2.2 and -Werror) on Windows or by using a
    different locale encoding (.e.g using LANG="C").

    I expect a lot of warnings from the Python standard library, and as many
    in third party modules and applications. So do you think that it is too
    late to change that in Python 3.3? One argument for changing it directly
    in Python 3.3 is that most users will not notice the change because
    their locale encoding is already UTF-8.

    An alternative is to:
    - Python 3.2: use the locale encoding but emit a warning if the locale
    encoding is not UTF-8
    - Python 3.3: use UTF-8 and emit a warning if the locale encoding is
    not UTF-8... or maybe always emit a warning?
    - Python 3.3: use UTF-8 (but don't emit warnings anymore)

    I don't think that Windows developer even know that they are writing
    files into the ANSI code page. MSDN documentation of
    WideCharToMultiByte() warns developer that the ANSI code page is not
    portable, even accross Windows computers:

    "The ANSI code pages can be different on different computers, or can be
    changed for a single computer, leading to data corruption. For the most
    consistent results, applications should use Unicode, such as UTF-8 or
    UTF-16, instead of a specific code page, unless legacy standards or data
    formats prevent the use of Unicode. If using Unicode is not possible,
    applications should tag the data stream with the appropriate encoding
    name when protocols allow it. HTML and XML files allow tagging, but text
    files do not."

    It will always be possible to use ANSI code page using
    encoding="mbcs" (only work on Windows), or an explicit code page number
    (e.g. encoding="cp2152").

    --

    The two other (rejetected?) options to improve open() are:

    - raise an error if the encoding argument is not set: will break most
    programs
    - emit a warning if the encoding argument is not set

    --

    Should I convert this email into a PEP, or is it not required?
    I think a PEP is needed.

    --
    Terry Jan Reedy
  • Georg Brandl at Jun 28, 2011 at 9:42 pm

    On 28.06.2011 14:24, Terry Reedy wrote:

    As discussed before on this list, I propose to set the default encoding
    of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
    open() is called without an explicit encoding and if the locale encoding
    is not UTF-8. Using the warning, you will quickly notice the potential
    problem (using Python 3.2.2 and -Werror) on Windows or by using a
    different locale encoding (.e.g using LANG="C").
    [...]
    Should I convert this email into a PEP, or is it not required?
    I think a PEP is needed.
    Absolutely. And I hope the hypothetical PEP would be rejected in this form.

    We need to stop making incompatible changes to Python 3. We had the chance
    and took it to break all kinds of stuff, some of it gratuitous, with 3.0 and
    even 3.1. Now the users need a period of compatibility and stability (just
    like the language moratorium provided for one aspect of Python).

    Think about porting: Python 3 uptake is not ahead of time (I don't want to
    say it's too slow, but it's certainly not too fast.) For the sake of porters'
    sanity, 3.x should not be a moving target. New features are not so much of
    a problem, but incompatibilities like this one certainly are.

    At the very least, a change like this needs a transitional strategy, like
    it has been used during the 2.x series:

    * In 3.3, accept "locale" as the encoding parameter, meaning the locale encoding
    * In 3.4, warn if encoding isn't given and the locale encoding isn't UTF-8
    * In 3.5, change default encoding to UTF-8

    It might be just enough to stress in the documentation that usage of the
    encoding parameter is recommended for cross-platform consistency.

    cheers,
    Georg
  • Terry Reedy at Jun 29, 2011 at 12:22 am

    On 6/28/2011 5:42 PM, Georg Brandl wrote:

    At the very least, a change like this needs a transitional strategy, like
    it has been used during the 2.x series:

    * In 3.3, accept "locale" as the encoding parameter, meaning the locale encoding
    * In 3.4, warn if encoding isn't given and the locale encoding isn't UTF-8
    * In 3.5, change default encoding to UTF-8
    3.5 should be 4-5 years off. I actually would not propose anything
    faster than that.

    --
    Terry Jan Reedy
  • Nick Coghlan at Jun 29, 2011 at 6:52 am

    On Wed, Jun 29, 2011 at 7:42 AM, Georg Brandl wrote:
    On 28.06.2011 14:24, Terry Reedy wrote:
    I think a PEP is needed.
    Absolutely. ?And I hope the hypothetical PEP would be rejected in this form.

    We need to stop making incompatible changes to Python 3. ?We had the chance
    and took it to break all kinds of stuff, some of it gratuitous, with 3.0 and
    even 3.1. ?Now the users need a period of compatibility and stability (just
    like the language moratorium provided for one aspect of Python).
    +1 to everything Georg said.

    - nothing can change in 3.2
    - perhaps provide a way for an application to switch the default
    behaviour between 'locale' and 'utf-8' in 3.3
    - if this is done, also provide a way to explicitly request the
    'locale' behaviour (likely via a locale dependent codec alias)
    - maybe start thinking about an actual transition to 'utf-8' as
    default in the 3.4/5 time frame

    Cheers,
    Nick.

    --
    Nick Coghlan?? |?? ncoghlan at gmail.com?? |?? Brisbane, Australia
  • Barry Warsaw at Jun 29, 2011 at 7:58 am

    On Jun 28, 2011, at 09:42 PM, Georg Brandl wrote:
    We need to stop making incompatible changes to Python 3. We had the chance
    and took it to break all kinds of stuff, some of it gratuitous, with 3.0 and
    even 3.1. Now the users need a period of compatibility and stability (just
    like the language moratorium provided for one aspect of Python).
    +1. I think this is the #1 complaint I hear about Python in talking to
    users. I think in general we do a pretty good job of maintaining backward
    compatibility between releases, but not a perfect job, and the places where we
    miss can be painful for folks. It may be difficult to achieve in all cases,
    but compatibility should be carefully and thoroughly considered for all
    changes, especially in the stdlib, and clearly documented where deliberate
    decisions to break that are adopted.

    -Barry
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: signature.asc
    Type: application/pgp-signature
    Size: 836 bytes
    Desc: not available
    URL: <http://mail.python.org/pipermail/python-dev/attachments/20110629/a57eee01/attachment.pgp>
  • Paul Moore at Jun 28, 2011 at 2:46 pm

    On 28 June 2011 14:43, Victor Stinner wrote:
    As discussed before on this list, I propose to set the default encoding
    of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
    open() is called without an explicit encoding and if the locale encoding
    is not UTF-8. Using the warning, you will quickly notice the potential
    problem (using Python 3.2.2 and -Werror) on Windows or by using a
    different locale encoding (.e.g using LANG="C").
    -1. This will make things harder for simple scripts which are not
    intended to be cross-platform.

    I use Windows, and come from the UK, so 99% of my text files are
    ASCII. So the majority of my code will be unaffected. But in the
    occasional situation where I use a ? sign, I'll get encoding errors,
    where currently things will "just work". And the failures will be data
    dependent, and hence intermittent (the worst type of problem). I'll
    write a quick script, use it once and it'll be fine, then use it later
    on some different data and get an error. :-(

    I appreciate that the point here is to make sure that people think a
    bit more carefully about encoding issues. But doing so by making
    Python less friendly for casual, adhoc script use, seems to me to be a
    mistake.
    I don't think that Windows developer even know that they are writing
    files into the ANSI code page. MSDN documentation of
    WideCharToMultiByte() warns developer that the ANSI code page is not
    portable, even accross Windows computers:
    Probably true. But for many uses they also don't care. If you're
    writing something solely for a one-off job on your own PC, the ANSI
    code page is fine, and provides interoperability with other programs
    on your PC, which is really what you care about. (UTF-8 without BOM
    displays incorrectly in Vim, wordpad, and powershell get-content. MBCS
    works fine in all of these. It also displays incorrectly in CMD type,
    but in a less familiar form than the incorrect display mbcs produces,
    for what that's worth...)
    It will always be possible to use ANSI code page using
    encoding="mbcs" (only work on Windows), or an explicit code page number
    (e.g. encoding="cp2152").
    So, in effect, you propose making the default favour writing
    multiplatform portable code at the expense of quick and dirty scripts?
    My personal view is that this is the wrong choice ("practicality beats
    purity") but I guess it's ultimately a question of Python's design
    philosophy.
    The two other (rejetected?) options to improve open() are:

    - raise an error if the encoding argument is not set: will break most
    programs
    - emit a warning if the encoding argument is not set
    IMHO, you missed another option - open() does not need improving, the
    current behaviour is better than any of the 3 options noted.

    Paul.
  • Steffen Daode Nurpmeso at Jun 28, 2011 at 3:06 pm

    @ Paul Moore <p.f.moore at gmail.com> wrote (2011-06-28 16:46+0200):
    UTF-8 without BOM displays incorrectly in vim(1)
    Stop right now (you're oh so wrong)! :-)

    (By the way: UTF-8 and BOM?
    Interesting things i learn on this list.
    And i hope in ten years we can laugh about this -> UTF-8
    transition all over the place, 'cause it's simply working.)

    --
    Ciao, Steffen
    sdaoden(*)(gmail.com)
    () ascii ribbon campaign - against html e-mail
    /\ www.asciiribbon.org - against proprietary attachments
  • Paul Moore at Jun 28, 2011 at 3:44 pm

    On 28 June 2011 16:06, Steffen Daode Nurpmeso wrote:
    @ Paul Moore <p.f.moore at gmail.com> wrote (2011-06-28 16:46+0200):
    UTF-8 without BOM displays incorrectly in vim(1)
    Stop right now (you're oh so wrong)! ?:-)
    Sorry. Please add "using the default settings of gvim on Windows". My
    context throughout was Windows not Unix. Sorry I didn't make that
    clear.
    (By the way: UTF-8 and BOM?
    Windows uses it, I believe. My tests specifically used files with no
    BOM, just utf8-encoded text. I made this statement to head off people
    assuming that UTF8 can be detected in Windows by looking at the first
    few bytes.
    Interesting things i learn on this list. :-)
    And i hope in ten years we can laugh about this -> UTF-8
    transition all over the place, 'cause it's simply working.)
    That would be good...

    Paul.
  • Toshio Kuratomi at Jun 28, 2011 at 4:33 pm

    On Tue, Jun 28, 2011 at 03:46:12PM +0100, Paul Moore wrote:
    On 28 June 2011 14:43, Victor Stinner wrote:
    As discussed before on this list, I propose to set the default encoding
    of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if
    open() is called without an explicit encoding and if the locale encoding
    is not UTF-8. Using the warning, you will quickly notice the potential
    problem (using Python 3.2.2 and -Werror) on Windows or by using a
    different locale encoding (.e.g using LANG="C").
    -1. This will make things harder for simple scripts which are not
    intended to be cross-platform.

    I use Windows, and come from the UK, so 99% of my text files are
    ASCII. So the majority of my code will be unaffected. But in the
    occasional situation where I use a ? sign, I'll get encoding errors,
    where currently things will "just work". And the failures will be data
    dependent, and hence intermittent (the worst type of problem). I'll
    write a quick script, use it once and it'll be fine, then use it later
    on some different data and get an error. :-(
    I don't think this change would make things "harder". It will just move
    where the pain occurs. Right now, the failures are intermittent on A)
    computers other than the one that you're using. or B) intermittent when run
    under a different user than yourself. Sys admins where I'm at are
    constantly writing ad hoc scripts in python that break because you stick
    something in a cron job and the locale settings suddenly become "C" and
    therefore the script suddenly only deals with ASCII characters.

    I don't know that Victor's proposed solution is the best (I personally would
    like it a whole lot more than the current guessing but I never develop on
    Windows so I can certainly see that your environment can lead to the
    opposite assumption :-) but something should change here. Issuing a warning
    like "open used without explicit encoding may lead to errors" if open() is
    used without an explicit encoding would help a little (at least, people who
    get errors would then have an inkling that the culprit might be an open()
    call). If I read Victor's previous email correctly, though, he said this
    was previously rejected.

    Another brainstorming solution would be to use different default encodings on
    different platforms. For instance, for writing files, utf-8 on *nix systems
    (including macosX) and utf-16 on windows. For reading files, check for a utf-16
    BOM, if not present, operate as utf-8. That would seem to address your
    issue with detection by vim, etc but I'm not sure about getting "?" in your
    input stream. I don't know where your input is coming from and how Windows
    equivalent of locale plays into that.

    -Toshio
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: not available
    Type: application/pgp-signature
    Size: 198 bytes
    Desc: not available
    URL: <http://mail.python.org/pipermail/python-dev/attachments/20110628/11ac1081/attachment.pgp>
  • Victor Stinner at Jun 28, 2011 at 10:34 pm

    Le mardi 28 juin 2011 ? 09:33 -0700, Toshio Kuratomi a ?crit :
    Issuing a warning like "open used without explicit encoding may lead
    to errors" if open() is used without an explicit encoding would help
    a little (at least, people who get errors would then have an inkling
    that the culprit might be an open() call). If I read Victor's previous
    email correctly, though, he said this was previously rejected.
    Oh sorry, I used the wrong word. I listed two other possible solutions,
    but there were not really rejetected. I just thaugh that changing the
    default encoding to UTF-8 was the most well accepted idea.

    If I mix different suggestions together: another solution is to emit a
    warning if the encoding is not specified (not only if the locale
    encoding is different than UTF-8). Using encoding="locale" would make it
    quiet. It would be annoying if the warning would be displayed by default
    ("This will make things harder for simple scripts which are not
    intended to be cross-platform." wrote Paul Moore). It only makes sense
    if we use the same policy than unclosed files/sockets: hidden by
    default, but it can be configured using command line options (-Werror,
    yeah!).
    Another brainstorming solution would be to use different default encodings on
    different platforms. For instance, for writing files, utf-8 on *nix systems
    (including macosX) and utf-16 on windows.
    I don't think that UTF-16 is a better choice than UTF-8 on Windows :-(
    For reading files, check for a utf-16 BOM, if not present, operate as utf-8.
    Oh oh. I already suggested to read the BOM. See
    http://bugs.python.org/issue7651 and read the email thread "Improve
    open() to support reading file starting with an unicode BOM"
    http://mail.python.org/pipermail/python-dev/2010-January/097102.html

    Reading the BOM is a can of worm, everybody expects something different.
    I forgot the idea of changing that.

    Victor
  • Terry Reedy at Jun 28, 2011 at 5:06 pm

    On 6/28/2011 10:46 AM, Paul Moore wrote:

    I use Windows, and come from the UK, so 99% of my text files are
    ASCII. So the majority of my code will be unaffected. But in the
    occasional situation where I use a ? sign, I'll get encoding errors,
    I do not understand this. With utf-8 you would never get a string
    encoding error.
    where currently things will "just work".
    As long as you only use the machine-dependent restricted character set.
    And the failures will be data dependent, and hence intermittent
    (the worst type of problem).
    That is the situation now, with platform/machine dependencies added in.
    Some people share code with other machines, even locally.
    So, in effect, you propose making the default favour writing
    multiplatform portable code at the expense of quick and dirty scripts?
    Let us frame it another way. Should Python installations be compatible
    with other Python installations, or with the other apps on the same
    machine? Part of the purpose of Python is to cover up platform
    differences, to the extent possible (and perhaps sensible -- there is
    the argument). This was part of the purpose of writing our own io module
    instead of using the compiler stdlib. The evolution of floating point
    math has gone in the same direction. For instance, float now expects
    uniform platform-independent Python-dependent names for infinity and nan
    instead of compiler-dependent names.

    As for practicality. Notepad++ on Windows offers ANSI, utf-8 (w,w/o
    BOM), utf-16 (big/little endian). I believe that ODF documents are utf-8
    encoded xml (compressed or not). My original claim for this proposal
    was/is that even Windows apps are moving to uft-8 and that someday
    making that the default for Python everywhere will be the obvious and
    sensible thing.

    --
    Terry Jan Reedy
  • Michael Foord at Jun 28, 2011 at 5:22 pm

    On 28/06/2011 18:06, Terry Reedy wrote:
    On 6/28/2011 10:46 AM, Paul Moore wrote:

    I use Windows, and come from the UK, so 99% of my text files are
    ASCII. So the majority of my code will be unaffected. But in the
    occasional situation where I use a ? sign, I'll get encoding errors,
    I do not understand this. With utf-8 you would never get a string
    encoding error.
    I assumed he meant that files written out as utf-8 by python would then
    be read in using the platform encoding (i.e. not utf-8 on Windows) by
    the other applications he is inter-operating with. The error would not
    be in Python but in those applications.
    where currently things will "just work".
    As long as you only use the machine-dependent restricted character set.
    Which is the situation he is describing. You do go into those details
    below, and which choice is "correct" depends on which trade-off you want
    to make.

    For the sake of backwards compatibility we are probably stuck with the
    current trade-off however - unless we deprecate using open(...) without
    an explicit encoding.

    All the best,

    Michael
    And the failures will be data dependent, and hence intermittent
    (the worst type of problem).
    That is the situation now, with platform/machine dependencies added in.
    Some people share code with other machines, even locally.
    So, in effect, you propose making the default favour writing
    multiplatform portable code at the expense of quick and dirty scripts?
    Let us frame it another way. Should Python installations be compatible
    with other Python installations, or with the other apps on the same
    machine? Part of the purpose of Python is to cover up platform
    differences, to the extent possible (and perhaps sensible -- there is
    the argument). This was part of the purpose of writing our own io
    module instead of using the compiler stdlib. The evolution of floating
    point math has gone in the same direction. For instance, float now
    expects uniform platform-independent Python-dependent names for
    infinity and nan instead of compiler-dependent names.

    As for practicality. Notepad++ on Windows offers ANSI, utf-8 (w,w/o
    BOM), utf-16 (big/little endian). I believe that ODF documents are
    utf-8 encoded xml (compressed or not). My original claim for this
    proposal was/is that even Windows apps are moving to uft-8 and that
    someday making that the default for Python everywhere will be the
    obvious and sensible thing.

    --
    http://www.voidspace.org.uk/

    May you do good and not evil
    May you find forgiveness for yourself and forgive others
    May you share freely, never taking more than you give.
    -- the sqlite blessing http://www.sqlite.org/different.html
  • Paul Moore at Jun 28, 2011 at 8:11 pm

    On 28 June 2011 18:22, Michael Foord wrote:
    On 28/06/2011 18:06, Terry Reedy wrote:
    On 6/28/2011 10:46 AM, Paul Moore wrote:

    I use Windows, and come from the UK, so 99% of my text files are
    ASCII. So the majority of my code will be unaffected. But in the
    occasional situation where I use a ? sign, I'll get encoding errors,
    I do not understand this. With utf-8 you would never get a string encoding
    error.
    I assumed he meant that files written out as utf-8 by python would then be
    read in using the platform encoding (i.e. not utf-8 on Windows) by the other
    applications he is inter-operating with. The error would not be in Python
    but in those applications.
    That is correct. Or files written out (as platform encoding) by other
    applications, will later be read in as UTF-8 by Python, and be seen as
    incorrect characters, or worse raise decoding errors. (Sorry, in my
    original post I said "encoding" where I meant "decoding"...)

    I'm not interested in allocating "blame" for the "error". I'm not
    convinced that it *is* an error, merely 2 programs with incompatible
    assumptions. What I'm saying is that compatibility between various
    programs on a single machine can, in some circumstances, be more
    important than compatibility between (the same, or different) programs
    running on different machines or OSes. And that I, personally, am in
    that situation.
    where currently things will "just work".
    As long as you only use the machine-dependent restricted character set.
    Which is the situation he is describing. You do go into those details below,
    and which choice is "correct" depends on which trade-off you want to make.

    For the sake of backwards compatibility we are probably stuck with the
    current trade-off however - unless we deprecate using open(...) without an
    explicit encoding.
    Backward compatibility is another relevant point. But other than that,
    it's a design trade-off, agreed. All I'm saying is that I see the
    current situation (which is in favour of quick script use and beginner
    friendly at the expense of conceptual correctness and forcing the user
    to think about his choices) as being preferable (and arguably more
    "Pythonic", in the sense that I see it as a case of "practicality
    beats purity" - although it's easy to argue that "in the face of
    ambiguity..." also applies here :-))

    Paul.
  • Antoine Pitrou at Jun 28, 2011 at 5:56 pm

    On Tue, 28 Jun 2011 13:06:44 -0400 Terry Reedy wrote:

    As for practicality. Notepad++ on Windows offers ANSI, utf-8 (w,w/o
    BOM), utf-16 (big/little endian).
    Well, that's *one* application. We would need much more data than that.
    I believe that ODF documents are utf-8
    encoded xml (compressed or not).
    XML doesn't matter for this discussion, since it explicitly declares the
    encoding. What we are talking about is "raw" text files that don't have
    an encoding declaration and for which the data format doesn't specify
    any default encoding (which also rules out Python source code, by the
    way).
    My original claim for this proposal
    was/is that even Windows apps are moving to uft-8
    and that someday
    making that the default for Python everywhere will be the obvious and
    sensible thing.
    True, but that may be 5 or 10 years from now.

    Regards

    Antoine.
  • Georg Brandl at Jun 28, 2011 at 6:18 pm

    On 28.06.2011 19:06, Terry Reedy wrote:
    On 6/28/2011 10:46 AM, Paul Moore wrote:

    I use Windows, and come from the UK, so 99% of my text files are
    ASCII. So the majority of my code will be unaffected. But in the
    occasional situation where I use a ? sign, I'll get encoding errors,
    I do not understand this. With utf-8 you would never get a string
    encoding error.
    Yes, but you'll get plenty of *decoding* errors.

    Georg
  • Victor Stinner at Jun 28, 2011 at 10:00 pm

    I don't think that Windows developer even know that they are writing
    files into the ANSI code page. MSDN documentation of
    WideCharToMultiByte() warns developer that the ANSI code page is not
    portable, even accross Windows computers:
    Probably true. But for many uses they also don't care. If you're
    writing something solely for a one-off job on your own PC, the ANSI
    code page is fine, and provides interoperability with other programs
    on your PC, which is really what you care about. (UTF-8 without BOM
    displays incorrectly in Vim, wordpad, and powershell get-content.
    I tried to open a text file encoded to UTF-8 (without BOM) on Windows
    Seven.

    The default application displays it correctly, it's the well known
    builtin notepad program.

    gvim is unable to detect the encoding, it reads the file using the ANSI
    code page (WTF? UTF-8 is correctly detected on Linux!?).

    Wordpad reads the file using the ANSI code page, it is unable to detect
    the UTF-8 encoding.

    The "type" command in a MS-Dos shell (cmd.exe) dosen't display the UTF-8
    correctly, but a file encoded to ANSI code is also display incorrectly.
    I suppose that the problem is that the terminal uses the OEM code page,
    not the ANSI code page.

    Visual C++ 2008 detects the UTF-8 encoding.

    I don't have other applications to test on my Windows Seven. I agree
    that UTF-8 is not well supported by "standard" Windows applications. I
    would at least expect that Wordpad and gvim are able to detect the UTF-8
    encoding.
    MBCS works fine in all of these. It also displays incorrectly in CMD type,
    but in a less familiar form than the incorrect display mbcs produces,
    for what that's worth...)
    True, the encoding of a text file encoded to the ANSI code page is
    correctly detected by all applications (except "type" in a shell, it
    should be the OEM/ANSI code page conflict).
    IMHO, you missed another option - open() does not need improving, the
    current behaviour is better than any of the 3 options noted.
    My original need is to detect that my program will behave differently on
    Linux and Windows, because open() uses the implicit locale encoding.
    Antoine suggested me to monkeypatch __builtins__.open to do that.

    Victor
  • Baptiste Carvello at Jun 29, 2011 at 7:21 am

    Le 28/06/2011 16:46, Paul Moore a ?crit :
    -1. This will make things harder for simple scripts which are not
    intended to be cross-platform.
    +1 to all you said.

    I frequently use the python command prompt or "python -c" for various quick
    tasks (mostly on linux). I would hate to replace my ugly, but working
    open('example.txt').read()
    with the unnecessarily verbose
    open('example.txt',encoding='utf-8').read()
    When using python that way as a "swiss army knife", typing does matter.


    My preferred solution would be:
    - emit a warning if the encoding argument is not set
    By the way, I just thought that for real programming, I would love to have a
    -Wcrossplatform command switch, which would warn for all unportable constructs
    in one go. That way, I don't have to remember which parts of 'os' wrap
    posix-only functionnality.

    Baptiste
  • Victor Stinner at Jun 29, 2011 at 10:01 am

    Le mercredi 29 juin 2011 ? 09:21 +0200, Baptiste Carvello a ?crit :
    By the way, I just thought that for real programming, I would love to have a
    -Wcrossplatform command switch, which would warn for all unportable constructs
    in one go. That way, I don't have to remember which parts of 'os' wrap
    posix-only functionnality.
    When I developed using PHP, error_reporting(E_ALL) was really useful. I
    would like a same function on Python, but I realized that it is not
    necessary. Python is already strict *by default*.

    Python can help developers to warn them about some "corner cases". We
    have already the -bb option for bytes/str warnings (in Python 3),
    -Werror to convert warnings to exceptions, and ResourceWarning (since
    Python 3.2) for unclosed files/sockets. I "just" would like a new
    warning for an implicit locale encoding, so -Wcrossplatform would be as
    easy as -Werror. -Werror is like Perl "use strict;" or PHP
    error_reporting(E_ALL). Use -Wd if you prefer to display warnings
    instead of raising exceptions.

    See issue #11455 and #11470 for a new "CompatibilityWarning", it's not
    "cross platform" but "cross Python" :-) It warns about implementation
    details like non strings keys in a type dict.

    Victor
  • Stefan Behnel at Jun 29, 2011 at 6:28 am

    Victor Stinner, 28.06.2011 15:43:
    In Python 2, open() opens the file in binary mode (e.g. file.readline()
    returns a byte string). codecs.open() opens the file in binary mode by
    default, you have to specify an encoding name to open it in text mode.

    In Python 3, open() opens the file in text mode by default. (It only
    opens the binary mode if the file mode contains "b".) The problem is
    that open() uses the locale encoding if the encoding is not specified,
    which is the case *by default*. The locale encoding can be:

    - UTF-8 on Mac OS X, most Linux distributions
    - ISO-8859-1 os some FreeBSD systems
    - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in
    Western Europe, cp952 in Japan, ...
    - ASCII if the locale is manually set to an empty string or to "C", or
    if the environment is empty, or by default on some systems
    - something different depending on the system and user configuration...

    If you develop under Mac OS X or Linux, you may have surprises when you
    run your program on Windows on the first non-ASCII character. You may
    not detect the problem if you only write text in english... until
    someone writes the first letter with a diacritic.
    I agree that this is a *very* common source of problems. People write code
    that doesn't care about encodings all over the place, and are then
    surprised when it stops working at some point, either by switching
    environments or by changing the data. I've seen this in virtually all
    projects I've ever come to work in[1]. So, eventually, all of that code was
    either thrown away or got debugged and fixed to use an explicit (and
    usually configurable) encoding.

    Consequently, I don't think it's a bad idea to break out of this ever
    recurring development cycle by either requiring an explicit encoding right
    from the start, or by making the default encoding platform independent. The
    opportunity to fix this was very unfortunately missed in Python 3.0.

    Personally, I don't buy the argument that it's harder to write quick
    scripts if an explicit encoding is required. Most code that gets written is
    not just quick scripts, and even those tend to live longer than initially
    intended.

    Stefan


    [1] Admittedly, most of those projects were in Java, where the situation is
    substantially worse than in Python. Java entirely lacks a way to define a
    per-module source encoding, and it even lacks a straight forward way to
    encode/decode a file with an explicit encoding. So, by default, *both*
    input encodings are platform dependent, whereas in Python it's only the
    default file encoding, and properly decoding a file is straight forward there.

Related Discussions

People

Translate

site design / logo © 2022 Grokbase