FAQ
I hope my understanding is correct and I'm not dreaming.

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

(+ technical papers)

It appears Python is just working in the opposite way.
sys.version
2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
repr(u'abc'.encode('utf-16-le'))
'a\x00b\x00c\x00'
repr(u'abc'.encode('utf-16-be'))
'\x00a\x00b\x00c'
repr(u'abc'.encode('utf-16'))
'\xff\xfea\x00b\x00c\x00'
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-be'))
False
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))
True

Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2

I attempted to find some precise discussions on that subject
and I failed.

Any thougths?

Search Discussions

  • Antoine Pitrou at Oct 12, 2010 at 1:47 pm

    On Tue, 12 Oct 2010 06:28:23 -0700 (PDT) jmfauth wrote:

    I hope my understanding is correct and I'm not dreaming.

    When an endianess is not specified, (BE, LE, unmarked forms),
    the Unicode Consortium specifies, the default byte serialization
    should be big-endian. [...]
    It appears Python is just working in the opposite way.
    [...]
    repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))
    True
    Python uses the host's endianness by default. So, on a little-endian
    machine, utf-16 and utf-32 will use little-endian encoding.
    While decoding, though, the BOM is read by both of these codecs, so
    there should be no interoperability problems:
    '\xff\xfea\x00b\x00c\x00'.decode('utf-16')
    u'abc'
    '\xfe\xff\x00a\x00b\x00c'.decode('utf-16')
    u'abc'


    (do note, though, that the explicit utf*-be and utf*-le variants do not
    add a BOM)

    Regards

    Antoine.
  • Jmfauth at Oct 12, 2010 at 2:49 pm

    On 12 oct, 15:47, Antoine Pitrou wrote:
    On Tue, 12 Oct 2010 06:28:23 -0700 (PDT)

    Python uses the host's endianness by default. So, on a little-endian
    machine, utf-16 and utf-32 will use little-endian encoding.

    Thanks. I never have been aware of this.
  • John Machin at Oct 12, 2010 at 8:00 pm

    jmfauth <wxjmfauth <at> gmail.com> writes:

    When an endianess is not specified, (BE, LE, unmarked forms),
    the Unicode Consortium specifies, the default byte serialization
    should be big-endian.

    See http://www.unicode.org/faq//utf_bom.html
    Q: Which of the UTFs do I need to support?
    and
    Q: Why do some of the UTFs have a BE or LE in their label,
    such as UTF-16LE?
    Sometimes it is necessary to read right to the end of an answer:

    Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

    A: [snip] the unmarked form uses big-endian byte serialization by default, but
    may include a byte order mark at the beginning to indicate the actual byte
    serialization used.
  • Jmfauth at Oct 13, 2010 at 7:07 am

    On 12 oct, 22:00, John Machin wrote:
    jmfauth <wxjmfauth <at> gmail.com> writes:
    When an endianess is not specified, (BE, LE, unmarked forms),
    the Unicode Consortium specifies, the default byte serialization
    should be big-endian.
    Seehttp://www.unicode.org/faq//utf_bom.html
    Q: Which of the UTFs do I need to support?
    and
    Q: Why do some of the UTFs have a BE or LE in their label,
    such as UTF-16LE?
    Sometimes it is necessary to read right to the end of an answer:

    Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

    A: [snip] the unmarked form uses big-endian byte serialization by default, but
    may include a byte order mark at the beginning to indicate the actual byte
    serialization used.


    Well, English is not my native language, however I think I read it
    correctly.

    My question had nothing to do with the BOM, the encoding/decoding
    or the BOM inclusion. My question was:

    "What should I understand by "utf-16"? "utf-16-le" or "utf-16-be"?

    And Antoine gave an answer.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedOct 12, '10 at 1:28p
activeOct 13, '10 at 7:07a
posts5
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase