FAQ
If i have a nested list, where the atoms are unicode strings, e.g.

# -*- coding: utf-8 -*-
ttt=[[u"?",u"?"], [u"???"],...]
print ttt

how can i print it without getting the u'\u1234' notation?
i.e. i want it print just like this: [[u"?"], ...]

I can of course write a loop then for each string use
"encode("utf-8")", but is there a easier way?

Thx.

Xah
xah at xahlee.org
? http://xahlee.org/

Search Discussions

  • Carsten Haese at Sep 10, 2007 at 3:12 pm

    On Mon, 2007-09-10 at 06:59 -0700, Xah Lee wrote:
    If i have a nested list, where the atoms are unicode strings, e.g.

    # -*- coding: utf-8 -*-
    ttt=[[u"?",u"?"], [u"???"],...]
    print ttt

    how can i print it without getting the u'\u1234' notation?
    i.e. i want it print just like this: [[u"?"], ...]

    I can of course write a loop then for each string use
    "encode("utf-8")", but is there a easier way?
    It's not quite clear why you want to do this, but this is how you could
    do it:

    print repr(ttt).decode("unicode_escape").encode("utf-8")

    However, I am getting the impression that this is a "How can I use 'X'
    to achieve 'Y'?" question instead of the preferable "How can I achieve
    'Y'?" type of question. In other words, printing the repr() of a list
    might not be the best solution to reach the actual goal, which you have
    not stated.

    HTH,
  • Xah Lee at Sep 10, 2007 at 10:14 pm
    On Sep 10, 8:12 am, Carsten Haese wrote: Xah Lee wrote:
    If i have a nested list, where the atoms are unicode strings, e.g.

    # -*- coding: utf-8 -*-
    ttt=[[u"?",u"?"], [u"???"],...]
    print ttt

    how can i print it without getting the u'\u1234' notation?
    i.e. i want it print just like this: [[u"?"], ...]


    Carsten Haese wrote:

    It's not quite clear why you want to do this, but this is how you
    could
    do it:

    print repr(ttt).decode("unicode_escape").encode("utf-8")


    Super! Thanks a lot.

    About why i want to... i think it's just simpler and easier on the
    eye?

    here's a example output from my program:
    [[u' ', 1022], [u'?', 472], [u' ', 128], [u'?w', 300], [u'?s', 12],
    [u'?|', 184],...]

    wouldn't it be preferable if Python print like this by default...

    Xah
    xah at xahlee.org
    ? http://xahlee.org/
  • Xah Lee at Sep 10, 2007 at 10:39 pm
    Google groups seems to be stripping my quotation marks lately.
    Here's a retry to post my previous message.

    --------------------------------------------------------------

    Xah Lee wrote:

    If i have a nested list, where the atoms are unicode strings, e.g.
    # -*- coding: utf-8 -*-
    ttt=[[u"?",u"?"], [u"???"],...]
    print ttt

    how can i print it without getting the u'\u1234' notation?
    i.e. i want it print just like this: [[u"?"], ...]


    Carsten Haese wrote:

    It's not quite clear why you want to do this, but this is how you
    could do it:

    print repr(ttt).decode("unicode_escape").encode("utf-8")


    Super! Thanks a lot.

    About why i want to... i think it's just simpler and easier on the
    eye?

    here's a example output from my program:
    [[u' ', 1022], [u'?', 472], [u' ', 128], [u'?w', 300], [u'?s', 12],
    [u'?|', 184],...]

    wouldn't it be preferable if Python print like this by default...

    Xah
    x... at xahlee.org
    ? http://xahlee.org/
  • Xah Lee at Sep 11, 2007 at 2:26 am
    This post is about some notes and corrections to a online article
    regarding unicod and python.

    --------------

    by happenstance i was reading:

    Unicode HOWTO
    http://www.amk.ca/python/howto/unicode

    Here's some problems i see:

    ? No conspicuous authorship. (however, oddly, it has a conspicuous
    acknowledgement of names listing.) (This problem is a indirect
    consequence of communism fanatism ushered by OpenSource movement)
    (Originally i was just going to write to the author on some
    corrections.)

    ? It's very wasteful of space. In most texts, the majority of the
    code points are less than 127, or less than 255, so a lot of space is
    occupied by zero bytes.

    Not true. In Asia, most chars has unicode number above 255. Considered
    globally, *possibly* today there are more computer files in Chinese
    than in all latin-alphabet based lang.

    ? Many Internet standards are defined in terms of textual data, and
    can't handle content with embedded zero bytes.

    Not sure what he mean by "can't handle content with embedded zero
    bytes". Overall i think this sentence is silly, and he's probably
    thinking in unix/linux.

    ? Encodings don't have to handle every possible Unicode
    character, ....

    This is inane. A encoding, by definition, turns numbers into binary
    numbers (in our context, it means a encoding handles all unicode chars
    by definition). What he really meant to say is something like this:
    "Practically speaking, most computer languages in western society
    don't need to support unicode with respect to the language's source
    file"

    ?
    UTF-8 has several convenient properties:
    1. It can handle any Unicode code point.
    ...


    As mentioned before, by definition, any Unicode encoding encodes all
    unicode char set. The mentioning of above as a "convenient property"
    is inane.

    ? 4.UTF-8 is fairly compact; the majority of code points are turned
    into two bytes, and values less than 128 occupy only a single byte.

    Note here, that utf-8 is relative compact only if most of your text
    are latin alphabets. If you are not a occidental men and you write
    Chinese, utf-8 is comparatively inefficient. (utf-8 as one of the
    Unicode encoding is probably comparatively inefficient for japanese,
    korean, Arabic, or any non-latin-alphabet based langs)

    Also note, the article overly focus on utf-8. Microsoft's Windows NT,
    is probably the first major operating system that support unicode
    throughly, and they use utf-16. For Much of America and Europe, which
    are currently roughly the leader in computing, utf-8 is more efficient
    in some sense (e.g. at least in disk space requirements). But consider
    global computing, in particular Chinese & Japanese, utf-16 is overall
    superior than utf-8.

    Part of the reason, that utf-8 is favored in this article, has to do
    with Linux (and unix). The reason unixes in general have choosen utf-8
    instead of utf-16, is largely because unix is one motherfucking bag of
    shit that it is impossible to support utf-16 without scraping a large
    chuck of unix things.

    PS I did not read the article in detail, but only roughly to see how
    Python handle unicode because i was often confused by python's encode/
    decode/unicode methods and functions.

    ... am gonna continue reading that article about Python specific
    issues...

    also note, this post is posted thru groups.google.com, and it contains
    the double angled quotation mark chars. As of 2 weeks ago, it
    quotation marks seems to be deleted in the process of posting, i.e.
    unicode name: "LEFT-POINTING DOUBLE ANGLE QUOTATION MARK" and "RIGHT-
    POINTING DOUBLE ANGLE QUOTATION MARK". Here, i enclose the double-
    angled quation mark inside a double curly quote: " ". If inside the
    double curly quote you see spaces, than that means google groups
    fucked up.

    References and Further readings:

    ? Unicode in Perl & Python
    http://xahlee.org/perl-python/unicode.html

    ? the Journey of a Foreign Character thru Internet
    http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html

    ? Unicode Characters Example
    http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html

    ? Python's unicodedata module
    http://xahlee.org/perl-python/unicodedata_module.html

    ? Emacs and Unicode Tips
    http://xahlee.org/emacs/emacs_n_unicode.html

    ? Java Tutorial: Unicode in Java
    http://xahlee.org/java-a-day/unicode_in_java.html

    ? Character Sets and Encoding in HTML
    http://xahlee.org/js/html_chars.html

    Xah
    xah at xahlee.org
    ? http://xahlee.org/
  • J. Cliff Dyer at Sep 11, 2007 at 2:49 am

    Xah Lee wrote:
    This post is about some notes and corrections to a online article
    regarding unicod and python.

    --------------

    by happenstance i was reading:

    Unicode HOWTO
    http://www.amk.ca/python/howto/unicode

    Here's some problems i see:

    ? No conspicuous authorship. (however, oddly, it has a conspicuous
    acknowledgement of names listing.) (This problem is a indirect
    consequence of communism fanatism ushered by OpenSource movement)
    (Originally i was just going to write to the author on some
    corrections.)

    ? It's very wasteful of space. In most texts, the majority of the
    code points are less than 127, or less than 255, so a lot of space is
    occupied by zero bytes.

    Not true. In Asia, most chars has unicode number above 255. Considered
    globally, *possibly* today there are more computer files in Chinese
    than in all latin-alphabet based lang.
    That's an interesting point. I'd be interested to see numbers on
    that, and how those numbers have changed over the past five years.
    Sadly, such data is most likely impossible to obtain.

    However, it should be pointed out that most *code*, whether written in
    the United States, New Zealand, India, China, or Botswana is written
    in English. In part because it has become a standard of sorts, much
    as italian was a standard for musical notation, due in part to the
    US's former (and perhaps current, but certainly fading) dominance in
    the field, and in part to the lack of solid support for unicode among
    many programming languages and compilers. Thus the author's bias, while
    inaccurate, is still understandable.
    ? Many Internet standards are defined in terms of textual data, and
    can't handle content with embedded zero bytes.

    Not sure what he mean by "can't handle content with embedded zero
    bytes". Overall i think this sentence is silly, and he's probably
    thinking in unix/linux.

    ? Encodings don't have to handle every possible Unicode
    character, ....

    This is inane. A encoding, by definition, turns numbers into binary
    numbers (in our context, it means a encoding handles all unicode chars
    by definition). What he really meant to say is something like this:
    "Practically speaking, most computer languages in western society
    don't need to support unicode with respect to the language's source
    file"

    ?
    UTF-8 has several convenient properties:
    1. It can handle any Unicode code point.
    ...


    As mentioned before, by definition, any Unicode encoding encodes all
    unicode char set. The mentioning of above as a "convenient property"
    is inane.
    No, it's not inane. UCS-2, for example, is a fixed width, 2-byte
    encoding that can handle any unicode code point up to 0xffff, but
    cannot handle the 3 and 4 byte extension sets. UCS-2 was developed
    for applications in which having fixed width characters is essential,
    but has the limitations of not being able to handle any Unicode code
    point. IIRC, when it was developed, it did handle every code point,
    and then Unicode grew. There is also a UCS-4 to handle this
    limitation. UTF-16 is based on a two-byte unit, but is variable
    width, like UTF-8, which makes it flexible enough to handle any code
    point, but harder to process, and a bear to seek through to a certain
    point.

    (I'm politely ignoring your ill-reasoned attacks on non-Microsoft OSes).

    Cheers,
    Cliff
  • Marc 'BlackJack' Rintsch at Sep 11, 2007 at 6:18 am

    On Mon, 10 Sep 2007 19:26:20 -0700, Xah Lee wrote:

    ? Many Internet standards are defined in terms of textual data, and
    can't handle content with embedded zero bytes.

    Not sure what he mean by "can't handle content with embedded zero
    bytes". Overall i think this sentence is silly, and he's probably
    thinking in unix/linux.
    No he's probably thinking of all the text based protocols (HTTP, SMTP, ?)
    and that one of the most used programming languages, C, can't cope with
    embedded null bytes in strings.
    ? Encodings don't have to handle every possible Unicode
    character, ....

    This is inane. A encoding, by definition, turns numbers into binary
    numbers (in our context, it means a encoding handles all unicode chars
    by definition).
    How do you encode chinese characters with the ISO-8859-1 encoding? This
    encoding obviously doesn't handle *all* unicode characters.
    ?
    UTF-8 has several convenient properties:
    1. It can handle any Unicode code point.
    ...


    As mentioned before, by definition, any Unicode encoding encodes all
    unicode char set. The mentioning of above as a "convenient property"
    is inane.
    You are being silly here.

    Ciao,
    Marc 'BlackJack' Rintsch
  • Xah Lee at Sep 11, 2007 at 10:46 am
    J. Cliff Dyer wrote:
    " ...UCS-2, for example, is a fixed width, 2-byte encoding that can
    handle any unicode code point up to 0xffff, but cannot handle the 3
    and 4 byte extension sets. "

    I was going to reply to say that this is a good point. But on my way i
    looked up wikipedia,
    http://en.wikipedia.org/wiki/UTF-16/UCS-2

    quote:
    " In computing, UTF-16 (16-bit Unicode Transformation Format) is a
    variable-length character encoding for Unicode, capable of encoding
    the entire Unicode repertoire. "

    and
    " UCS-2 (2-byte Universal Character Set) is an obsolete character
    encoding which is a predecessor to UTF-16. The UCS-2 encoding form is
    nearly identical to that of UTF-16, except that it does not support
    surrogate pairs and therefore can only encode characters in the BMP
    range U+0000 through U+FFFF. "

    So, the matter isn't simple. (i.e. it is not decisive to say i'm
    incorrect in my original criticism about that article's statement on
    utf-8.)

    ------------

    Btw, i think i should mention, that i have read from cover to cover
    the unicode 3 specification in 2002. (one heavy, thick, large, deep
    blue colored book)

    Another resource that contributed my understanding of unicode, is the
    book
    "CJKV Information Processing" by Ken Lunde, which i read in the same
    year.

    Also of interest, is that i learned about a year ago, the chinese
    encoding
    http://en.wikipedia.org/wiki/GB_18030
    which is required by law for all computers sold in China to support,
    is actually a Unicode encoding. Specifically, in encompasses all the
    chars in Unicode.

    Also relevant info in our discussion, is that recently i was looking
    at alexa.com's web ranking:

    http://alexa.com/site/ds/top_sites?ts_mode=global&lang=none

    and noticed several pure chinese lang websites are among the top 100.

    Baidu.com (??) is at top 8 today, followed by
    ??? (http://www.qq.com) at 12, and
    ?? sina.com.cn at 19, etc.

    It is somewhat amazing in the context of computing and languages. No
    other non-English lang comes close.

    (Note here also, Chinese as measured by number of speakers, is roughly
    4 times that of English.
    http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages
    This fact, coupled with developement and commercialization of China in
    the past decade, are reasons of the above web ranking result.
    )

    Not relevant in our discussion, but I happend to also notice a site
    named youporn.com (was ranked 69 few weeks ago). youporn.com is
    basically like youtube.com, but with porn vids. It has long been my
    thought, that the progress of humanity in a society can be measured as
    by its popularity and acceptance of porn. (in fact i recall seeing
    some academic (or not) report about this few months ago... couldn't
    remember where now) Society as a whole, have improved dramatically
    since the communication revolution in particulart started with the
    web.

    (see Xah's Porn Outspeak
    http://xahlee.org/PageTwo_dir/Personal_dir/porn_movies.html

    For more info about youtube.com, see:

    http://en.wikipedia.org/wiki/Youporn

    curious party might also check out

    http://en.wikipedia.org/wiki/Youtube

    which is a major phenomenon, in my opinion, contributed to the
    progress of humanities far more than, say, any university or
    educational institution.

    (my thesis in general in this direction, is that communication, the
    main media of knowledge, is the utmost factor in human animal's
    progress with respect to what's generally considered humanitarianism.
    More important than, say, the need to decry war, have laws, maintain
    peace, spread gospels, aid the poor, ... etc. (and in fact, in this
    thesis, i consider what commonly considered as good activities such as
    aiding the poor, or any moral attitude and activities about good of
    humanity (such as OpenSource), are in fact criminal in their effects
    and almost in their intention too ...)) )

    PS for some reason message posted thru google groups service since the
    past week or so are stripping off the unicode chars double angle
    brackets (U+00AB and U+00BB). For that reason, in this msg i've also
    used double curly quotes "" whenever i have double angle brackets.

    Xah
    xah at xahlee.org
    ? http://xahlee.org/
  • Sion Arrowsmith at Sep 11, 2007 at 4:46 pm

    Xah Lee wrote:
    " It's very wasteful of space. In most texts, the majority of the
    code points are less than 127, or less than 255, so a lot of space is
    occupied by zero bytes. "

    Not true. In Asia, most chars has unicode number above 255. Considered
    globally, *possibly* today there are more computer files in Chinese
    than in all latin-alphabet based lang.
    This doesn't hold water. There are many good reasons for preferring
    UTF16 over UTF8, but unless you know you're only ever going to be
    handling scripts from Unicode blocks above Arabic, it's reasonable
    to assume that UTF8 will be at least as compact. Consider that
    transcoding a Chinese file from UTF16 to UTF8 will probably increase
    its size by 50% (the CJK ideograph blocks encode to 3 bytes). While
    transcoding a document in a Western European langauge the other way
    can be expected to increase its size by up to 100% (every single-
    byte character is doubled). You'd have to be talking about double to
    volume of CJK data before switching from UTF8 to UTF16 becomes even
    a break-even proposition space-wise.

    (It's curious to note that the average word length in English is
    often taken to be 6 letters. Similarly, in UTF8-encoded Chinese the
    average word length is 6 bytes....)

    --
    \S -- siona at chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
    "Frankly I have no feelings towards penguins one way or the other"
    -- Arthur C. Clarke
    her nu become? se bera eadward ofdun hl?ddre heafdes b?ce bump bump bump

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedSep 10, '07 at 1:59p
activeSep 11, '07 at 4:46p
posts9
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase