FAQ
Hi,

I only know a little bit of xml and I'm trying to parse a xml document
in order to save its elements in a file (dictionaries inside a list).

When I access a url from python 2.3.3 running in Linux with the
following lines:
resposta = urllib.urlopen(url)
xmldoc = minidom.parse(resposta)
resposta.close()

I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt;DataSet&gt;
~ &lt;Order&gt;
~ &lt;Customer&gt;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
_____________________________________________________________

In the lines below, I try to get all the child nodes from string, first
by counting them, and then ignoring the /n ones:

stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
dataSetNode = stringNode.childNodes[0]
numNos = len(dataSetNode.childNodes)
todosNos={}
for no in range(numNos):
todosNos[no] = dataSetNode.childNodes[no].toxml()
posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
print posicaoXml

(I'm almost sure there's a simpler way to do this...)
_____________________________________________________________

I don't get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
~ <Order>
~ <Customer>439</Customer>
(... others ...)
~ </Order>
~ </DataSet>
</string>

and the lines I posted work as intended.

I already browsed the web, I know it's about the escape characters, but
I didn't find a simple solution for this.

I tried to use LL2XML.py and unescape function with a simple replace
text = text.replace("&lt;", "<")
but I had to convert the xml document to string and then I could not (or
don't know) how to convert it back to xml object.

How can I solve this? Please, explain it having in mind that I'm just
beggining with Xml and I'm not very experienced in Python, too.


Luis

Search Discussions

  • Martin v. Löwis at Jan 19, 2005 at 10:07 pm

    Luis P. Mendes wrote:
    I get the following result:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www......">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    Most likely, this result is correct, and your document
    really does contain

    &lt;Order&gt;

    I don't get any elements. But, if I access the same url via a browser,
    the result in the browser window is something like:

    <string xmlns="http://www......">
    ~ <DataSet>
    Most likely, your browser is incorrect (or atleast confusing), and
    renders &lt; as "<", even though this is not markup.
    I already browsed the web, I know it's about the escape characters, but
    I didn't find a simple solution for this.
    Not sure what "this" is. AFAICT, everything works correctly.

    Regards,
    Martin
  • Luis P. Mendes at Jan 20, 2005 at 1:05 pm
    this is the xml document:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www......">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;
    (... others ...)
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>

    When I do:

    print xmldoc.toxml()

    it prints:
    <?xml version="1.0" ?>
    <string xmlns="http://www...">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;

    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>

    __________________________________________________________
    with: stringNode = xmldoc.childNodes[0]
    print stringNode.toxml()
    I get:
    <string xmlns="http://www.......">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;

    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>
    ______________________________________________________________________

    with: DataSetNode = stringNode.childNodes[0]
    print DataSetNode.toxml()

    I get:

    &lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;

    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;
    _______________________________________________________________-

    so far so good, but when I issue the command:

    print DataSetNode.childNodes[0]

    I get:
    IndexError: tuple index out of range

    Why the error, and why does it return a tuple?
    Why doesn't it return:
    &lt;Order&gt;
    &lt;Customer&gt;439&lt;/Customer&gt;

    &lt;/Order&gt;
    ??
  • Kent Johnson at Jan 20, 2005 at 2:01 pm

    Luis P. Mendes wrote:
    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    this is the xml document:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http://www......">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;
    (... others ...)
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>
    This is an XML document containing a single tag, <string>, whose content is text containing
    entity-escaped XML.

    This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

    All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
    <string> tag to be able to treat it as structured XML.

    Kent
  • Irmen de Jong at Jan 20, 2005 at 5:06 pm
    Kent Johnson wrote:
    [...]
    This is an XML document containing a single tag, <string>, whose content
    is text containing entity-escaped XML.

    This is *not* an XML document containing tags <DataSet>, <Order>,
    <Customer>, etc.

    All the behaviour you are seeing is a consequence of this. You need to
    unescape the contents of the <string> tag to be able to treat it as
    structured XML.
    The unescaping is usually done for you by the xml parser that you use.

    --Irmen
  • Kent Johnson at Jan 20, 2005 at 5:33 pm

    Irmen de Jong wrote:
    Kent Johnson wrote:
    [...]
    This is an XML document containing a single tag, <string>, whose
    content is text containing entity-escaped XML.

    This is *not* an XML document containing tags <DataSet>, <Order>,
    <Customer>, etc.

    All the behaviour you are seeing is a consequence of this. You need to
    unescape the contents of the <string> tag to be able to treat it as
    structured XML.

    The unescaping is usually done for you by the xml parser that you use.
    Yes, so if your XML contains for example
    <stuff>&lt;not a tag&gt;</stuff>

    and you parse this and ask for the *text* content of the <stuff> tag, you will get the string
    "<not a tag>"

    but it's still *not* a tag. If you try to get child elements of the <stuff> element there will be none.

    This is exactly the confusion the OP has.
    --Irmen
  • Martin v. Löwis at Jan 20, 2005 at 5:46 pm

    Irmen de Jong wrote:
    The unescaping is usually done for you by the xml parser that you use.
    Usually, but not in this case. If you have a text that looks like
    XML, and you want to put it into an XML element, the XML file uses
    &lt; and &gt;. The XML parser unescapes that as < and >. However, it
    does not then consider the < and > as markup, and it shouldn't.

    Regards,
    Martin
  • Irmen de Jong at Jan 20, 2005 at 6:13 pm

    Martin v. L?wis wrote:
    Irmen de Jong wrote:
    The unescaping is usually done for you by the xml parser that you use.

    Usually, but not in this case. If you have a text that looks like
    XML, and you want to put it into an XML element, the XML file uses
    &lt; and &gt;. The XML parser unescapes that as < and >. However, it
    does not then consider the < and > as markup, and it shouldn't.
    That's also what I said?

    The unescaping of the XML entities in the contents of the OP's
    <string> element is done for you by the parser,
    so you will get a text node with the <,>,&,whatever in there.
    The OP probably wants to feed that to a new xml parser instance
    to process it as markup.
    Or perhaps the way the original XML document is constructed is
    flawed.

    --Irmen
  • Martin v. Löwis at Jan 20, 2005 at 6:37 pm

    Irmen de Jong wrote:
    Usually, but not in this case. If you have a text that looks like
    XML, and you want to put it into an XML element, the XML file uses
    &lt; and &gt;. The XML parser unescapes that as < and >. However, it
    does not then consider the < and > as markup, and it shouldn't.

    That's also what I said?
    You said it in response to
    All the behaviour you are seeing is a consequence of this. You need
    to unescape the contents of the <string> tag to be able to treat it
    as structured XML.
    In that context, I interpreted
    The unescaping is usually done for you by the xml parser that you
    use.
    as "The parser should have done what you want; if the parser didn't,
    that is is bug in the parser".
    The OP probably wants to feed that to a new xml parser instance
    to process it as markup.
    Or perhaps the way the original XML document is constructed is
    flawed.
    Either of these, indeed - probably the latter.

    Regards,
    Martin
  • Martin v. Löwis at Jan 20, 2005 at 5:44 pm

    Luis P. Mendes wrote:
    with: DataSetNode = stringNode.childNodes[0]
    print DataSetNode.toxml()

    I get:

    &lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;

    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;
    _______________________________________________________________-

    so far so good, but when I issue the command:

    print DataSetNode.childNodes[0]

    I get:
    IndexError: tuple index out of range

    Why the error, and why does it return a tuple?
    The DataSetNode has no children, because it is not
    an Element node, but a Text node. In XML, an element
    is denoted by

    <DataSet>...</DataSet>

    and *not* by

    &lt;DataSet&gt;...&lt;/DataSet&gt;

    The latter is just a single string, represented
    in XML as a Text node. It does not give you any
    hierarchy whatsoever.

    As a text node does not have any children, its
    childNode members is a empty tuple; accessing
    that tuple gives you an IndexError.

    Regards,
    Martin
  • Luis P. Mendes at Jan 20, 2005 at 7:10 pm
    I would like to thank everyone for your answers, but I'm not seeing the
    light yet!

    When I access the url via the Firefox browser and look into the source
    code, I also get:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http................">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>

    should I take the contents of the string tag that is text and replace
    all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
    how to do it?

    or should I use another parser that accomplishes the task with no need
    to replace the escaped characters?
  • Martin v. Löwis at Jan 20, 2005 at 8:54 pm

    Luis P. Mendes wrote:
    When I access the url via the Firefox browser and look into the source
    code, I also get:

    <?xml version="1.0" encoding="utf-8"?>
    <string xmlns="http................">&lt;DataSet&gt;
    ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt;
    ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>
    Please do try to understand what you are seeing. This is crucial for
    understanding what happens.

    You may have the understanding that XML can be represented as a tree.
    This would be good - if not, please read a book that explains why
    XML can be considered as a tree.

    In the tree, you have inner nodes, and leaf nodes. For example,
    the document

    <a>
    <b>Hello</b>
    <c>World</c>
    </a>

    has 5 nodes (ignoring whitespace content):

    Element:a ---- Element:b ---- Text:"Hello"
    \-- Element:c ---- Text:"World"

    So the leaf nodes are typically Text nodes (unless you
    have an empty element). Your document has this structure:

    Element:string ---- Text:"""<DataSet>
    <Order>
    <Customer>439</Customer>
    </Order>
    </DataSet>"""

    So the ***TEXT*** contains the letter "<", just like it contains
    the letters "O" and "r". There IS no element Order in your document,
    no matter how hard you look.

    If you want a DataSet *element* in your document, it should
    read

    <string xmlns="...">
    <DataSet>
    <Order>
    <Customer>439</Customer>
    </Order
    </DataSet>
    </string>

    As this is the document you apparently want to process, complain
    to whoever gave you that other document.
    should I take the contents of the string tag that is text and replace
    all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
    No. We still don't know what you want to achieve, so it is difficult to
    advise you what to do. My best advise is that whoever generates the XML
    document should fix it.
    or should I use another parser that accomplishes the task with no need
    to replace the escaped characters?
    No. The parser is working correctly.

    The document you got can also be interpreted as containing another
    XML document as a text. This is evil, but apparently people are doing
    it, anyway. If you really want that embedded document, you need
    first to extract it.

    To see what I mean, do

    print DataSetNode.data

    The .data attribute gives you the string contents of
    a text node. You could use this as an XML document, and
    parse it again to an XML parser. This would be ugly,
    but might be your only choice if the producer of the
    document is unwilling to adjust.

    Regards,
    Martin
  • Jeremy Bowers at Jan 21, 2005 at 12:01 am

    On Thu, 20 Jan 2005 21:54:30 +0100, Martin v. L?wis wrote:

    Luis P. Mendes wrote:
    When I access the url via the Firefox browser and look into the source
    code, I also get:

    <?xml version="1.0" encoding="utf-8"?> <string
    xmlns="http................">&lt;DataSet&gt; ~ &lt;Order&gt;
    ~ &lt;Customer&gt;439&lt;/Customer&gt; ~ &lt;/Order&gt;
    &lt;/DataSet&gt;</string>
    Please do try to understand what you are seeing. This is crucial for
    understanding what happens.
    From extremely painful and lengthy personal experience, Luis, I
    ***extremely*** strongly recommend taking the time to nail this down until
    you really, really, really understand what is going on. Until you can
    explain it to somebody else coherently, ideally.

    Mixing escaping levels like this absolutely, positively *must* be done
    correctly, or extremely-painful-to-debug problems will result.

    (My painful experience was layering an RPC implementation in plain text on
    top of IM messages, where I was dealing with everything from the socket
    level up except the XML parser. Ultimately it turned out there was a
    problem in the XML parser, it rendered "&amp;amp;" as "&", which is wrong
    wrong wrong. But that took a *long* time to find, especially as I had
    other bugs in the way.)

    Since you're layering XML in XML, test &amp;amp; and &amp;amp;amp; to make
    sure they work correctly; those usually show encoding errors. And, given
    your current understanding of the issue, do not write your own decoding
    function unless you absolutely can't avoid it.
  • Luis P. Mendes at Jan 21, 2005 at 5:55 pm
    ~From your experience, do you think that if this wrong XML code could be
    meant to be read only by somekind of Microsoft parser, the error will
    not occur?

    I'll try to explain:

    xml producer writes the code in Windows platform and 'thinks' that every
    client will read/parse the code with a specific Windows parser. Could
    that (wrong) XML code parse correctly in that kind of specific Windows
    client?

    Or in other words:

    Do you know any windows parser that could turn that erroneous encoding
    to a xml tree, with four or five inner levels of tags?

    I'd like to thank everyone for taking the time to answer me.


    Luis
  • Martin v. Löwis at Jan 22, 2005 at 12:49 am

    Luis P. Mendes wrote:
    From your experience, do you think that if this wrong XML code could be
    meant to be read only by somekind of Microsoft parser, the error will
    not occur?
    This is very unlikely. MSXML would never do this incorrectly.

    Regards,
    Martin
  • Luis P. Mendes at Jan 21, 2005 at 5:55 pm
    ~From your experience, do you think that if this wrong XML code could be
    meant to be read only by somekind of Microsoft parser, the error will
    not occur?

    I'll try to explain:

    xml producer writes the code in Windows platform and 'thinks' that every
    client will read/parse the code with a specific Windows parser. Could
    that (wrong) XML code parse correctly in that kind of specific Windows
    client?

    Or in other words:

    Do you know any windows parser that could turn that erroneous encoding
    to a xml tree, with four or five inner levels of tags?

    I'd like to thank everyone for taking the time to answer me.


    Luis
  • Fredrik Lundh at Jan 21, 2005 at 6:08 pm

    Luis P. Mendes wrote:

    xml producer writes the code in Windows platform and 'thinks' that every
    client will read/parse the code with a specific Windows parser. Could
    that (wrong) XML code parse correctly in that kind of specific Windows
    client?
    not if it's an XML parser.
    Do you know any windows parser that could turn that erroneous encoding
    to a xml tree, with four or five inner levels of tags?
    any parser *can* do that, but I doubt many parsers will do it unless
    you ask it to (by extracting the string and parsing it again). here's the
    elementtree version:

    from elementtree.ElementTree import parse, XML

    wrapper = parse(urllib.urlopen(url))
    dataset = XML(wrapper.findtext("{http://www......}string"))

    </F>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJan 19, '05 at 8:02p
activeJan 22, '05 at 12:49a
posts17
users6
websitepython.org

People

Translate

site design / logo © 2022 Grokbase