FAQ
Hello, All!

i have 3mb long XML document with about 150000 lines (i think it has about
200000 elements there) which i want to parse to DOM to work with.
first i thought there will be no problems, but there were..
first i tried Python.. there's special interest group that wants to "make
Python become the premier language for XML processing" so i thought there
will be no problems with this document..
i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
it - i don't want such processing.. i think this is because of that fat
class impementation - possibly Python had some significant overhead for each
object instance, or something like this..

then i asdf-installed s-xml package and tried it with it. it ate only 25
megs for lxml representation. i think interning element names helped a lot..
it was CLISP that has unicode inside, so i think it could be even less
without unicode..

then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks like
interning helps a lot 8-]

then i tried Perl XML::DOM.. it was better than python - about 180megs, but
it was slowest.. at least it consumed mem slower than python 8-]

and java.. with default parser it took 45mbs.. maybe it interned strings,
but there was overhead from classes - storing trees is definitely what's
lisp optimized for 8-]

so lisp is winner.. but it has not standard way (even no non-standard but
simple) way to write binary IEEE floating point representation, so common
lisp suck and i will use c++ for my task.. 8-]]]

With best regards, Alex 'killer_storm' Mizrahi.

Search Discussions

  • Hans Nowak at Jul 12, 2004 at 1:32 am

    Alex Mizrahi wrote:

    i have 3mb long XML document with about 150000 lines (i think it has about
    200000 elements there) which i want to parse to DOM to work with.
    first i thought there will be no problems, but there were..
    first i tried Python.. there's special interest group that wants to "make
    Python become the premier language for XML processing" so i thought there
    will be no problems with this document..
    i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
    it - i don't want such processing.. i think this is because of that fat
    class impementation - possibly Python had some significant overhead for each
    object instance, or something like this..
    Have you tried ElementTree?

    http://effbot.org/zone/element-index.htm

    HTH,

    --
    Hans Nowak (hans at zephyrfalcon.org)
    http://zephyrfalcon.org/
  • Alex Mizrahi at Jul 12, 2004 at 1:53 am
    (message (Hello 'Hans)
    (you :wrote :on '(Sun, 11 Jul 2004 21:32:11 -0400))
    (
    i have 3mb long XML document with about 150000 lines (i think it has
    about 200000 elements there) which i want to parse to DOM to work
    with.
    first i thought there will be no problems, but there were..
    first i tried Python.. there's special interest group that wants to
    "make
    Python become the premier language for XML processing" so i thought
    there will be no problems with this document..
    i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i
    killed it - i don't want such processing.. i think this is because of
    that fat class impementation - possibly Python had some significant
    overhead for each object instance, or something like this..
    HN> Have you tried ElementTree?

    no..
    just tried it - it eats 133 megs and parses for quite a long time, however
    it works..
    i'll consider using it because processing xml in c++ appears to be pain in
    ass..

    )
    (With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
    (prin1 "Jane dates only Lisp programmers"))
  • Peter Hansen at Jul 12, 2004 at 2:15 am

    Alex Mizrahi wrote:

    i have 3mb long XML document with about 150000 lines (i think it has about
    200000 elements there) which i want to parse to DOM to work with.
    Often, problems with performance come down the using the
    wrong algorithm, or using the wrong architecture for the
    problem at hand.

    Are you absolutely certain that using a full in-memory DOM
    representation is the best for your problem? It seems
    very unlikely to me that it really is...

    For example, there are approaches which can read in the
    document incrementally (and I'm not just talking SAX here),
    rather than read the whole thing at once.

    I found your analysis fairly simplistic, on the whole...

    -Peter
  • Tim Bradshaw at Jul 12, 2004 at 10:44 am
    Paul Rubin <http://phr.cx at NOSPAM.invalid> wrote in message news:<7xllhpc3yx.fsf at ruckus.brouhaha.com>...
    Rather than either reading incrementally or else slurping in the
    entire document in many-noded glory, I wonder if anyone's implemented
    a parser that scans over the XML doc and makes a compact sequential
    representation of the tree structure, and then provides access methods
    that let you traverse the tree as if it were a real DOM, by fetching
    the appropriate strings from the (probably mmap'ed) disk file as you
    walk around in the tree.
    I dunno if this has been done recently, but this is the sort of thing
    that people used to do for very large SGML documents. I forget the
    details, but I remember things that were some hundreds of MB (parsed
    or unparsed I'm not sure) which would be written out in some parsed
    form into large files, which could then be manipulated as if the whole
    object was there. Of course no one would care about a few hundred MB
    of memory now, but they did then (this was 91-92 I think).

    I had a theory of doing this all lazily, so you wouldn't have to do
    the (slow) parsing step up-front but would just lie and say `OK, I
    parsed it', then actually doing the work only on demand.

    --tim
  • Alex Mizrahi at Jul 12, 2004 at 12:03 pm
    (message (Hello 'Peter)
    (you :wrote :on '(Sun, 11 Jul 2004 22:15:50 -0400))
    (
    i have 3mb long XML document with about 150000 lines (i think it has
    about 200000 elements there) which i want to parse to DOM to work
    with.
    PH> Often, problems with performance come down the using the wrong
    PH> algorithm, or using the wrong architecture for the problem at hand.

    i see nothing wrong in loading 3 mb data into RAM. however, implementation
    details made it 100 times larger and it was the problem..

    PH> Are you absolutely certain that using a full in-memory DOM
    PH> representation is the best for your problem? It seems very unlikely
    PH> to me that it really is...

    format i'm dealing with is quite chaotic and i'm going to work with it
    interactively - track down myself where data i need lie and see how can i
    extract data..
    it's only a small part of task and it's needed only temporarily, so i don't
    need best thing possible - i need something that just works..

    )
    (With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
    (prin1 "Jane dates only Lisp programmers"))
  • Peter Hansen at Jul 12, 2004 at 1:17 pm

    Alex Mizrahi wrote:

    (message (Hello 'Peter)
    i have 3mb long XML document with about 150000 lines (i think it has
    about 200000 elements there) which i want to parse to DOM to work
    with.
    PH> Often, problems with performance come down the using the wrong
    PH> algorithm, or using the wrong architecture for the problem at hand.

    i see nothing wrong in loading 3 mb data into RAM. however, implementation
    details made it 100 times larger and it was the problem..
    What is problematic about that for you?
    PH> Are you absolutely certain that using a full in-memory DOM
    PH> representation is the best for your problem? It seems very unlikely
    PH> to me that it really is...

    format i'm dealing with is quite chaotic and i'm going to work with it
    interactively - track down myself where data i need lie and see how can i
    extract data..
    You didn't mention this before. If you're doing it interactively,
    which I assume means with you actually typing lines of code that
    will be executed in real-time, as you hit ENTER, then why the heck
    are you concerned about the RAM footprint (i.e. there's nothing wrong
    with loading 100MB of data into RAM for such a case either) or
    even performance (since clearly you are going to be spending many
    times more time working with the data than it takes to parse it,
    even with some of the slower methods)?

    Like I said, pick the right architecture for your problem domain...

    -Peter
  • Alex Mizrahi at Jul 12, 2004 at 3:43 pm
    (message (Hello 'Peter)
    (you :wrote :on '(Mon, 12 Jul 2004 09:17:58 -0400))
    (
    i see nothing wrong in loading 3 mb data into RAM. however,
    implementation details made it 100 times larger and it was the
    problem..
    PH> What is problematic about that for you?

    i have only 500 mb of RAM. i've killed parser when it ate 400 MBs.. i don't
    know for sure how much it needs but it's too much anyway..

    )
    (With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
    (prin1 "Jane dates only Lisp programmers"))
  • Alex Mizrahi at Jul 12, 2004 at 12:05 pm
    (message (Hello 'Paul)
    (you :wrote :on '(11 Jul 2004 20:13:42 -0700))
    (
    For example, there are approaches which can read in the document
    incrementally (and I'm not just talking SAX here), rather than read
    the whole thing at once.
    PR> Rather than either reading incrementally or else slurping in the
    PR> entire document in many-noded glory, I wonder if anyone's
    PR> implemented a parser that scans over the XML doc and makes a compact
    PR> sequential representation of the tree structure, and then provides
    PR> access methods that let you traverse the tree as if it were a real
    PR> DOM, by fetching the appropriate strings from the (probably mmap'ed)
    PR> disk file as you walk around in the tree.

    that would be nice.. i remember i've did something like this for one binary
    chunky format - thingie avoided allocating new memory as long as possible..

    )
    (With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
    (prin1 "Jane dates only Lisp programmers"))
  • Richie Hindle at Jul 12, 2004 at 12:32 pm
    [Paul]
    Rather than either reading incrementally or else slurping in the
    entire document in many-noded glory, I wonder if anyone's implemented
    a parser that scans over the XML doc and makes a compact sequential
    representation of the tree structure, and then provides access methods
    that let you traverse the tree as if it were a real DOM, by fetching
    the appropriate strings from the (probably mmap'ed) disk file as you
    walk around in the tree.
    It's not exactly what you describe here, but xml.dom.pulldom is roughly
    this. You access the higher-level nodes by SAX, thus not using massive
    amounts of memory, but you can access the children of those higher-level
    elements using DOM. The canonical example is processing a large number of
    XML records - the XML document is arbitrarily large but the individual
    records aren't. Pulldom passes each record to you SAX-style, and you use
    DOM to process the record.

    Uche Ogbuji has a short article on xml.dom.pulldom here:
    http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html

    Here's the example from that article. Line 16 is the key to it - that's the
    point at which you switch from SAX to DOM:

    1 #Get the first line in Act IV, scene II
    2
    3 from xml.dom import pulldom
    4
    5 hamlet_file = open("hamlet.xml")
    6
    7 events = pulldom.parse(hamlet_file)
    8 act_counter = 0
    9 for (event, node) in events:
    10 if event == pulldom.START_ELEMENT:
    11 if node.tagName == "ACT":
    12 act_counter += 1
    13 scene_counter = 1
    14 if node.tagName == "SCENE":
    15 if act_counter == 4 and scene_counter == 2:
    16 events.expandNode(node)
    17 #Traditional DOM processing starts here
    18 #Get all descendant elements named "LINE"
    19 line_nodes = node.getElementsByTagName("LINE")
    20 #Print the text data of the text node
    21 #of the first LINE element
    22 print line_nodes[0].firstChild.data
    23 scene_counter += 1


    --
    Richie Hindle
    richie at entrian.com
  • Cameron Laird at Jul 12, 2004 at 4:06 pm
    In article <7xllhpc3yx.fsf at ruckus.brouhaha.com>,
    Paul Rubin wrote:
    Peter Hansen <peter at engcorp.com> writes:
    For example, there are approaches which can read in the
    document incrementally (and I'm not just talking SAX here),
    rather than read the whole thing at once.
    Rather than either reading incrementally or else slurping in the
    entire document in many-noded glory, I wonder if anyone's implemented
    a parser that scans over the XML doc and makes a compact sequential
    representation of the tree structure, and then provides access methods
    that let you traverse the tree as if it were a real DOM, by fetching
    the appropriate strings from the (probably mmap'ed) disk file as you
    walk around in the tree.
    While I don't yet follow all the places this thread has gone,
    tDOM <URL: http://wiki.tcl.tk/tdom > is where I turn when I
    want *fast* DOMish handling. Although its author favors Tcl,
    there's no particular reason not to use it with Python.
  • John J. Lee at Jul 12, 2004 at 6:56 pm
    claird at lairds.us (Cameron Laird) writes:
    [...]
    While I don't yet follow all the places this thread has gone,
    tDOM <URL: http://wiki.tcl.tk/tdom > is where I turn when I
    want *fast* DOMish handling. Although its author favors Tcl,
    there's no particular reason not to use it with Python.
    How would one use it with Python?


    John
  • John Lenton at Jul 12, 2004 at 10:42 am

    On Mon, 12 Jul 2004 03:19:03 +0300, Alex Mizrahi wrote:
    Hello, All!

    i have 3mb long XML document with about 150000 lines (i think it has about
    200000 elements there) which i want to parse to DOM to work with.
    first i thought there will be no problems, but there were..
    first i tried Python.. there's special interest group that wants to "make
    Python become the premier language for XML processing" so i thought there
    will be no problems with this document..
    i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
    it - i don't want such processing.. i think this is because of that fat
    class impementation - possibly Python had some significant overhead for each
    object instance, or something like this..
    in my experience xml.dom.minidom is a hog; there are several other DOM
    and tree-building (i.e., DOMish) parsers out there, most of them
    better from a performance point of view; my own personal favourite is
    libxml2, but google for the issue and you'll even come across people
    who have compared the different things.

    --
    John Lenton (jlenton at gmail.com) -- Random fortune:
    bash: fortune: command not found
  • R. Mattes at Jul 12, 2004 at 12:43 pm

    On Mon, 12 Jul 2004 02:19:03 +0200, Alex Mizrahi wrote:

    Hello, All!

    i have 3mb long XML document with about 150000 lines (i think it has
    about 200000 elements there) which i want to parse to DOM to work with.
    first i thought there will be no problems, but there were.. first i
    tried Python.. there's special interest group that wants to "make Python
    become the premier language for XML processing" so i thought there will
    be no problems with this document.. i used xml.dom.minidom to parse it..
    after it ate 400 meg of RAM i killed it - i don't want such processing..
    i think this is because of that fat class impementation - possibly
    Python had some significant overhead for each object instance, or
    something like this..
    First of all: which parser did you actually use? There are quite a number
    of XML parsers for python. I personally use the libxml2 one and never had
    memory proplems like you describe.
    then i asdf-installed s-xml package and tried it with it. it ate only 25
    megs for lxml representation. i think interning element names helped a
    lot.. it was CLISP that has unicode inside, so i think it could be even
    less without unicode..
    Hmmm. Hmmm ... i guess you know that you compare apples with pears?
    S-XML is a nice, small parser but nowhere near a standard conformant
    XML parser. Have a closer look at the webpage: no handling of character
    encoding, no CDATA, can't handle what the author calls "special tags"
    (like processing instruction), no schema/DTD support, and, most
    important, no namespace support!
    then i tried C++ - TinyXML. it was fast, but ate 65 megs.. ye, looks
    like interning helps a lot 8-]
    Interning is _much_ easier without namespaces.
    then i tried Perl XML::DOM.. it was better than python - about 180megs,
    but it was slowest.. at least it consumed mem slower than python 8-]

    and java.. with default parser it took 45mbs.. maybe it interned
    strings, but there was overhead from classes - storing trees is
    definitely what's lisp optimized for 8-]
    But you never got to a _full_ DOM with you lxml parsing. What you got was
    a list-of-lists. There's no 'parent' implementation for your lxml
    elements (which means that you might need to path the whole thing
    arround all the time).

    If you want a serious comparison you either need to compare s-xml with
    similar "lightweight" parsers in Perl/Python/Ruby etc. or write your own
    fully DOM compliant parser in LISP (or is there one allready? I'm still
    looking for a good one).

    Just my 0.02 $

    Ralf Mattes

    so lisp is winner.. but it has not standard way (even no non-standard
    but simple) way to write binary IEEE floating point representation, so
    common lisp suck and i will use c++ for my task.. 8-]]]

    With best regards, Alex 'killer_storm' Mizrahi.
  • Peter Hansen at Jul 12, 2004 at 1:20 pm

    R. Mattes wrote:

    But you never got to a _full_ DOM with you lxml parsing. What you got was
    a list-of-lists. There's no 'parent' implementation for your lxml
    elements (which means that you might need to path the whole thing
    arround all the time).

    If you want a serious comparison you either need to compare s-xml with
    similar "lightweight" parsers in Perl/Python/Ruby etc.
    If that's what he wants, it would be called PyRXP in the Python world,
    a Python wrapper around the RXP library, available from
    http://www.reportlab.com/ . Has a much smaller footprint than the
    DOM representations, as you would expect, and lightning fast.

    -Peter
  • Frank Buss at Jul 12, 2004 at 1:04 pm

    "Alex Mizrahi" wrote:

    so lisp is winner.. but it has not standard way (even no non-standard
    but simple) way to write binary IEEE floating point representation, so
    common lisp suck and i will use c++ for my task.. 8-]]]
    This is a silly reason. You can integrate a C program for this, if you
    like, but Common Lisp has the function INTEGER-DECODE-FLOAT, which you can
    use for this, see for example this listing:

    http://www.ai.mit.edu/people/cvince/date/src/code/java/serialization.lisp

    But you need not to use this little complicated function, because CLISP has
    the non-standard function EXT:WRITE-FLOAT, which "writes a floating-point
    number in IEEE 754 binary representation":

    http://clisp.cons.org/impnotes/stream-dict.html#bin-output

    I assume other Lisp implementations have something similiar.
  • Cameron Laird at Jul 12, 2004 at 5:06 pm
    In article <2le3nlFb82reU1 at uni-berlin.de>,
    Alex Mizrahi wrote:
    .
    .
    .
    so lisp is winner.. but it has not standard way (even no non-standard but
    simple) way to write binary IEEE floating point representation, so common
    lisp suck and i will use c++ for my task.. 8-]]]
    .
    .
    .
    I'm having trouble following this thread. In isolation, though,
    this claim *can't* be true. Is there a particular Lisp implemen-
    tation you have in mind?
  • Uche Ogbuji at Jul 16, 2004 at 4:40 pm
    "Alex Mizrahi" <udodenko at hotmail.com> wrote in message news:<2le3nlFb82reU1 at uni-berlin.de>...
    Hello, All!

    i have 3mb long XML document with about 150000 lines (i think it has about
    200000 elements there) which i want to parse to DOM to work with.
    first i thought there will be no problems, but there were..
    first i tried Python.. there's special interest group that wants to "make
    Python become the premier language for XML processing" so i thought there
    will be no problems with this document..
    i used xml.dom.minidom to parse it.. after it ate 400 meg of RAM i killed
    it - i don't want such processing.. i think this is because of that fat
    class impementation - possibly Python had some significant overhead for each
    object instance, or something like this..
    minidom has about a 60X - 80X load factor on average (comparing XML
    file size to memory working set). You're claiming you saw a 130X load
    factor. That sounds odd Are there special characteristics of your
    document you're not mentioning?

    cDomlette, part of 4Suite, only has a 10X load factor, so I'd guess
    your example would end up with a 30MB memory working set. cDomlette
    does use string interning, as one example of optimization techniques.
    4Suite also provides you XSLT, XPath, RELAX NG and some other
    processing goodies.

    See:

    http://4suite.org/
    http://www.xml.com/pub/a/2002/10/16/py-xml.html
    http://uche.ogbuji.net/akara/nodes/2003-01-01/domlettes?xslt=/akara/akara.xslt

    As you can see, cDomlette is as DOM-like as minidom, so very easy to
    use (for the record neither is a compliant W3C DOM implementation).

    Also, in my next XML.com article I cover a technique that uses SAX to
    break large XML documents into series of small DOMs, one after the
    other, so that the memory penalty is *very* low, depending on your
    document structure. It works with any DOM implementation that meets
    the Python DOM binding, including minidom and cDomlette.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedJul 12, '04 at 12:19a
activeJul 16, '04 at 4:40p
posts18
users11
websitepython.org

People

Translate

site design / logo © 2022 Grokbase