FAQ
Hall?chen!

I parse an XML file with ElementTree and get the contets with
the .attrib, .text, .get etc methods of the tree's nodes.
Additionally, I use the "find" and "findtext" methods.

My problem is that if there is only ASCII, these methods return
ordinary strings instead of unicode. So sometimes I get str,
sometimes I get unicode. Can one change this globally so that they
only return unicode?

Tsch?,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: torsten.bronger at jabber.rwth-aachen.de

Search Discussions

  • Stefan Behnel at Mar 14, 2009 at 9:57 pm

    Torsten Bronger wrote:
    I parse an XML file with ElementTree and get the contets with
    the .attrib, .text, .get etc methods of the tree's nodes.
    Additionally, I use the "find" and "findtext" methods.

    My problem is that if there is only ASCII, these methods return
    ordinary strings instead of unicode. So sometimes I get str,
    sometimes I get unicode. Can one change this globally so that they
    only return unicode?
    That's a convenience measure to reduce memory and processing overhead.
    Could you explain why this is a problem for you?

    Stefan
  • Torsten Bronger at Mar 14, 2009 at 10:06 pm
    Hall?chen!

    Stefan Behnel writes:
    Torsten Bronger wrote:
    [...]

    My problem is that if there is only ASCII, these methods return
    ordinary strings instead of unicode. So sometimes I get str,
    sometimes I get unicode. Can one change this globally so that
    they only return unicode?
    That's a convenience measure to reduce memory and processing
    overhead.
    But is this really worth the inconsistency of having partly str and
    partly unicode, given that the common origin is unicode XML data?
    Could you explain why this is a problem for you?
    I feed ElementTree's output to functions in the unicodedata module.
    And they want unicode input. While it's not a big deal to write
    e.g. unicodedata.category(unicode(my_character)), I find this rather
    wasteful.

    Tsch?,
    Torsten.

    --
    Torsten Bronger, aquisgrana, europa vetus
    Jabber ID: torsten.bronger at jabber.rwth-aachen.de
  • Stefan Behnel at Mar 15, 2009 at 9:48 am

    Torsten Bronger wrote:
    Hall?chen!
    und zur?ck!

    Stefan Behnel writes:
    Torsten Bronger wrote:
    [...]

    My problem is that if there is only ASCII, these methods return
    ordinary strings instead of unicode. So sometimes I get str,
    sometimes I get unicode. Can one change this globally so that
    they only return unicode?
    That's a convenience measure to reduce memory and processing
    overhead.
    But is this really worth the inconsistency of having partly str and
    partly unicode, given that the common origin is unicode XML data?
    Yes. It's no difference in almost all use cases, as long as you assume Py2
    string handling semantics. In Py3, you will always get Unicode strings anyway.

    Could you explain why this is a problem for you?
    I feed ElementTree's output to functions in the unicodedata module.
    And they want unicode input. While it's not a big deal to write
    e.g. unicodedata.category(unicode(my_character)), I find this rather
    wasteful.
    I just looked at the code. It seems that you can use your own
    XMLTreeBuilder subclass and overwrite the "._fixtext()" method like this:

    def _fixtext(self, text):
    return text

    Then pass an instance of that as "parser" when parsing in ElementTree. That
    should do the trick.

    Stefan
  • Torsten Bronger at Mar 15, 2009 at 10:26 am
    Hall?chen!

    Stefan Behnel writes:
    Torsten Bronger wrote:
    Stefan Behnel writes:
    Torsten Bronger wrote:
    [...]

    My problem is that if there is only ASCII, these methods return
    ordinary strings instead of unicode. So sometimes I get str,
    sometimes I get unicode. Can one change this globally so that
    they only return unicode?
    [...]

    I just looked at the code. It seems that you can use your own
    XMLTreeBuilder subclass and overwrite the "._fixtext()" method
    like this:

    def _fixtext(self, text):
    return text
    Great. Thus, the following monkeypatch seems to do the trick:

    from xml.etree import ElementTree
    # FixMe: Must go away with Python 3
    ElementTree.XMLTreeBuilder._fixtext = lambda self, text: text

    Thank you!

    Tsch?,
    Torsten.

    --
    Torsten Bronger, aquisgrana, europa vetus
    Jabber ID: torsten.bronger at jabber.rwth-aachen.de

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMar 14, '09 at 4:32p
activeMar 15, '09 at 10:26a
posts5
users2
websitepython.org

2 users in discussion

Torsten Bronger: 3 posts Stefan Behnel: 2 posts

People

Translate

site design / logo © 2022 Grokbase