FAQ
Hi,
?
I'm parsing an xml file using elementtree, but it seems to get stuck on certain non-ascii characters (for example: "?"). I'm using Python 2.4. Here's the relevant code fragment:
?
# CODE:
for element in doc.getiterator():
? try:
????m = re.match(search_text, str(element.text))
? except UnicodeEncodeError:
??? raise # I want to get rid of this exception.

# PRINTBACK:
????m = re.match(search_text, str(element.text))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 4: ordinal not in range(128)
?
How can I get rid of this unicode encode error. I tried:
s = str(element.text)
s.encode("utf-8")
(and then feeding it into the regex)
?
The xml file is in UTF-8. Somehow I need to tell the program not to use ascii but utf-8, right?
?
Thanks in advance!

Cheers!!
Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the face of ambiguity, refuse the temptation to guess.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20091125/32965c58/attachment.htm>

Search Discussions

  • Spir at Nov 25, 2009 at 2:12 pm

    Albert-Jan Roskam wrote:

    # CODE:
    for element in doc.getiterator():
    ? try:
    ????m = re.match(search_text, str(element.text))
    ? except UnicodeEncodeError:
    ??? raise # I want to get rid of this exception.

    First, you should separate both actions done in a single statement to isolate the source of error:
    for element in doc.getiterator():
    ? try:
    ????source = str(element.text)
    ? except UnicodeEncodeError:
    ??? raise # I want to get rid of this exception.
    else:
    ????m = re.match(search_text, source)

    I guess
    source = unicode(element;text, "utf8")
    should do the job if, actually, you know elements are utf8 encoded (else try latin1, or better get proper information on origin of you doc files).

    PS: I just discovered python's builtin attribute file.encoding that should give you the proper encoding to pass to unicode(..., encoding).
    PPS: You should in fact decode the whole source before parsing it, no? (meaning parsing a unicode object, not encoded text)

    Denis
    ________________________________

    la vita e estrany

    http://spir.wikidot.com/
  • Kent Johnson at Nov 25, 2009 at 4:55 pm

    On Wed, Nov 25, 2009 at 8:44 AM, Albert-Jan Roskam wrote:

    Hi,

    I'm parsing an xml file using elementtree, but it seems to get stuck on
    certain non-ascii characters (for example: "?"). I'm using Python 2.4.
    Here's the relevant code fragment:

    # CODE:
    for element in doc.getiterator():
    try:
    m = re.match(search_text, str(element.text))
    except UnicodeEncodeError:
    raise # I want to get rid of this exception.
    # PRINTBACK:
    m = re.match(search_text, str(element.text))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in
    position 4: ordinal not in range(128)
    You can't convert element.text to a str because it contains non-ascii
    characters. Why are you converting it? re.match() will accept a unicode
    string as its argument.
    How can I get rid of this unicode encode error. I tried:
    s = str(element.text)
    s.encode("utf-8")
    (and then feeding it into the regex)
    This fails because it is the str() that won't work. To get UTF-8 use
    s = element.text.encode('utf-8')
    but I don't think this is the correct solution.

    The xml file is in UTF-8. Somehow I need to tell the program not to use
    ascii but utf-8, right?

    No, just pass Unicode to re.match().
    Kent
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20091125/0cd4e1a8/attachment.htm>
  • Albert-Jan Roskam at Nov 26, 2009 at 1:13 pm
    OK, thanks a lot Spir and Kent for your replies. I converted element.text to str because some of the element.text were integers and these caused TypeErrors later on in the program. I don't have the program here (it's in the office) so I can't tell you the exact details. It's a search-and-replace program where users can enter a search text (or regex pattern) and a replace text. The source file is an xml file. Currently, strings with non-ascii letters still need to be inputted in unicode format, eg. u'enqu\xeate' instead of "enqu?te". Kinda ugly. I'll try to fix that later. Thanks again!

    Cheers!!

    Albert-Jan



    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    In the face of ambiguity, refuse the temptation to guess.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    --- On Wed, 11/25/09, Kent Johnson wrote:

    From: Kent Johnson <kent37 at tds.net>
    Subject: Re: [Tutor] UnicodeEncodeError
    To: "Albert-Jan Roskam" <fomcl at yahoo.com>
    Cc: "tutor at python.org tutor at python.org tutor at python.org" <tutor at python.org>
    Date: Wednesday, November 25, 2009, 5:55 PM

    On Wed, Nov 25, 2009 at 8:44 AM, Albert-Jan Roskam wrote:


    Hi,
    ?
    I'm parsing an xml file using elementtree, but it seems to get stuck on certain non-ascii characters (for example: "?"). I'm using Python 2.4. Here's the relevant code fragment:
    ?
    # CODE:
    for element in doc.getiterator():
    ? try:
    ????m = re.match(search_text, str(element.text))
    ? except UnicodeEncodeError:
    ??? raise # I want to get rid of this exception.

    # PRINTBACK:
    ????m = re.match(search_text, str(element.text))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 4: ordinal not in range(128)

    You can't convert element.text to a str because it contains non-ascii characters. Why are you converting it? re.match() will accept a unicode string as its argument.



    ?
    How can I get rid of this unicode encode error. I tried:
    s = str(element.text)
    s.encode("utf-8")
    (and then feeding it into the regex)
    This fails because it is the str() that won't work. To get UTF-8 use
    ? s = element.text.encode('utf-8')
    ?but I don't think this is the correct solution.

    ?

    The xml file is in UTF-8. Somehow I need to tell the program not to use ascii but utf-8, right?

    No, just pass Unicode to re.match().

    Kent







    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20091126/6ed8a80e/attachment.htm>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouptutor @
categoriespython
postedNov 25, '09 at 1:44p
activeNov 26, '09 at 1:13p
posts4
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase