FAQ
I've been working on a way to parse an XML document and convert it into a python dictionary. I want to maintain the hierarchy of the XML. Here is the sample XML I have been working on:

<collection>
<comic title="Sandman" number='62'>
<writer>Neil Gaiman</writer>
<penciller pages='1-9,18-24'>Glyn Dillon</penciller>
<penciller pages="10-17">Charles Vess</penciller>
</comic>
</collection>

This is my first stab at this:

#!/usr/bin/env python

from lxml import etree

def generateKey(element):
if element.attrib:
key = (element.tag, element.attrib)
else:
key = element.tag
return key

class parseXML(object):
def __init__(self, xmlFile = 'test.xml'):
self.xmlFile = xmlFile

def parse(self):
doc = etree.parse(self.xmlFile)
root = doc.getroot()
key = generateKey(root)
dictA = {}
for r in root.getchildren():
keyR = generateKey(r)
if r.text:
dictA[keyR] = r.text
if r.getchildren():
dictA[keyR] = r.getchildren()

newDict = {}
newDict[key] = dictA
return newDict

if __name__ == "__main__":
px = parseXML()
newDict = px.parse()
print newDict

This is the output:
163>./parseXML.py
{'collection': {('comic', {'number': '62', 'title': 'Sandman'}): [<Element writer at -482193f4>, <Element penciller at -482193cc>, <Element penciller at -482193a4>]}}

The script doesn't descend all of the way down because I'm not sure how to hand a XML document that may have multiple layers. Advice anyone? Would this be a job for recursion?

Thanks!

Search Discussions

  • Alan Gauld at Nov 14, 2009 at 8:50 am
    "Christopher Spears" <cspears2002 at yahoo.com> wrote
    I've been working on a way to parse an XML document and
    convert it into a python dictionary. I want to maintain the hierarchy of
    the XML.
    Here is the sample XML I have been working on:

    <collection>
    <comic title="Sandman" number='62'>
    <writer>Neil Gaiman</writer>
    <penciller pages='1-9,18-24'>Glyn Dillon</penciller>
    <penciller pages="10-17">Charles Vess</penciller>
    </comic>
    </collection>

    This is my first stab at this:

    #!/usr/bin/env python

    from lxml import etree

    def generateKey(element):
    if element.attrib:
    key = (element.tag, element.attrib)
    else:
    key = element.tag
    return key
    So how are you handling multiple identical tags? It looks from your code
    that you will replace the content of the previous tag with the content of
    the
    last found tag? I would expect your keys to have some reference to
    either the parse depth or a sequuence count. In your sample XML the
    problem never arises and maybe in your real data it will never happen
    either, but in the general case it is quite common for the same tag
    and attribute pair to be used multiple times in a document.

    class parseXML(object):
    def __init__(self, xmlFile = 'test.xml'):
    self.xmlFile = xmlFile

    def parse(self):
    doc = etree.parse(self.xmlFile)
    root = doc.getroot()
    key = generateKey(root)
    dictA = {}
    for r in root.getchildren():
    keyR = generateKey(r)
    if r.text:
    dictA[keyR] = r.text
    if r.getchildren():
    dictA[keyR] = r.getchildren()

    The script doesn't descend all of the way down because I'm
    not sure how to hand a XML document that may have multiple layers.
    Advice anyone? Would this be a job for recursion?
    Recursion is the classic way to deal with tree structures so
    yes you could use there. provided your tree never exceeds
    Pythons recursion depth limit (I think its still 1000 levels).

    I'm not sure how converting etree's tree structure into a dictionary
    will help you however. It seems like a lot of work for a small gain.

    hth,

    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/
  • Kent Johnson at Nov 14, 2009 at 1:03 pm

    On Sat, Nov 14, 2009 at 1:14 AM, Christopher Spears wrote:
    I've been working on a way to parse an XML document and convert it into a python dictionary. ?I want to maintain the hierarchy of the XML. ?Here is the sample XML I have been working on:

    <collection>
    ?<comic title="Sandman" number='62'>
    ? ?<writer>Neil Gaiman</writer>
    ? ?<penciller pages='1-9,18-24'>Glyn Dillon</penciller>
    ? ?<penciller pages="10-17">Charles Vess</penciller>
    ?</comic>
    </collection>
    This is the output:
    163>./parseXML.py
    {'collection': {('comic', {'number': '62', 'title': 'Sandman'}): [<Element writer at -482193f4>, <Element penciller at -482193cc>, <Element penciller at -482193a4>]}}
    This seems an odd format. How are you going to use it? How is this
    better than the native ElementTree structure?
    The script doesn't descend all of the way down because I'm not sure how to hand a XML document that may have multiple layers. ?Advice anyone? ?Would this be a job for recursion?
    Yes. Here is an example that might be helpful:
    http://code.activestate.com/recipes/410469/

    Kent
  • Christopher Spears at Nov 14, 2009 at 6:47 pm
    Thanks!? I have a lot of XML files at work that users search through.? I want to parse the XML into a python dictionary and then read the dictionary into a database that users can use to search through the thousands of files.

    Basically, the user would submit a query like "Neil Gaiman" and then the program would return the name of the files in which the words "Neil Gaiman" appears.

    I thought I might be able to use the tags to speed up the search. For example, maybe the program will only look at the "writer" tags, or I can ask the program to show me everything under the "comic" tag.


    --- On Sat, 11/14/09, Kent Johnson wrote:
    From: Kent Johnson <kent37 at tds.net>
    Subject: Re: [Tutor] parsing XML into a python dictionary
    To: "Christopher Spears" <cspears2002 at yahoo.com>
    Cc: tutor at python.org
    Date: Saturday, November 14, 2009, 5:03 AM
    On Sat, Nov 14, 2009 at 1:14 AM,
    Christopher Spears
    wrote:
    I've been working on a way to parse an XML document
    and convert it into a python dictionary. ?I want to
    maintain the hierarchy of the XML. ?Here is the sample XML
    I have been working on:
    <collection>
    ?<comic title="Sandman" number='62'>
    ? ?<writer>Neil Gaiman</writer>
    ? ?<penciller pages='1-9,18-24'>Glyn
    Dillon</penciller>
    ? ?<penciller pages="10-17">Charles
    Vess</penciller>
    ?</comic>
    </collection>
    This is the output:
    163>./parseXML.py
    {'collection': {('comic', {'number': '62', 'title':
    'Sandman'}): [<Element writer at -482193f4>,
    <Element penciller at -482193cc>, <Element
    penciller at -482193a4>]}}

    This seems an odd format. How are you going to use it? How
    is this
    better than the native ElementTree structure?
    The script doesn't descend all of the way down because
    I'm not sure how to hand a XML document that may have
    multiple layers. ?Advice anyone? ?Would this be a job for
    recursion?

    Yes. Here is an example that might be helpful:
    http://code.activestate.com/recipes/410469/

    Kent
  • Stefan Behnel at Nov 15, 2009 at 1:45 pm

    Christopher Spears, 14.11.2009 19:47:
    Thanks! I have a lot of XML files at work that users search through. I
    want to parse the XML into a python dictionary and then read the dictionary
    into a database that users can use to search through the thousands of files.
    I think "database" is the right keyword here. Depending on how large your
    "thousands of files" are and what the actual content of each file is, a
    full-text search engine (e.g. pylucene) or an XML database might be the
    right tool, instead of trying to write something up yourself.

    If you want to use something that's in Python's standard library, consider
    parsing the XML files as a stream instead of a document tree (look for the
    iterparse() function in lxml.etree or the xml.etree.ElementTree package),
    and safe the extracted data into a sqlite3 database.

    You can also use such a database as a kind of cache that stores relevant
    information for each file, and update that information whenever you notice
    that a file has been modified.

    Stefan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouptutor @
categoriespython
postedNov 14, '09 at 6:14a
activeNov 15, '09 at 1:45p
posts5
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase