FAQ
Hi Python Tutor folks

This is a rather long post, but i wanted to include all the details &
everything i have tried so far myself, so please bear with me & read the
entire boringly long post.

Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.

I am looking for a specific element..there are several 10s/100s occurrences
of that element in the 1gb file.

I need to detect them & then for each 1, i need to copy all the content b/w
the element's start & end tags & create a smaller xml


0. I am a python & xml n00b, s& have been relying on the excellent beginner
book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng
this, you are AWESOME & so is your witty & humorous writing style)

My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad
cpuq9400.
On this i am running sun virtualbox(3.2.12), with ubuntu 10.10(maverick) as
guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest
ubuntu os.

1. Almost all exmaples pf parsing xml in python, i have seen, start off with
these 4 lines of code.

import xml.etree.ElementTree as etree
tree = etree.parse('*path_to_ginormous_xml*')
root = tree.getroot() #my huge xml has 1 root at the top level
print root

2. In the 2nd line of code above, as Mark explains in DIP, the parse
function builds & returns a tree object, in-memory(RAM), which represents
the entire document.
I tried this code, which works fine for a small ( ~ 1MB), but when i run
this simple 4 line py code in a terminal for my HUGE target file (1GB),
nothing happens.
In a separate terminal, i run the top command, & i can see a python process,
with memory (the VIRT column) increasing from 100MB , all the way upto
2100MB.

I am guessing, as this happens (over the course of 20-30 mins), the tree
representing is being slowly built in memory, but even after 30-40 mins,
nothing happens.
I dont get an error, seg fault or out_of_memory exception.

3. I also tried using lxml, but an lxml tree is much more expensive, as it
retains more info about a node's context, including references to it's
parent.

[http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]

When i ran the same 4line code above, but with lxml's elementree ( using the
import below in line1of the code above)
import lxml.etree as lxml_etree

i can see the memory consumption of the python process(which is running the
code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as
it nears the total system memory(2gb)

I ran the code from 1 terminal window (screenshot :
http://imgur.com/ozLkB.png)
& ran top from another terminal (http://imgur.com/HAoHA.png)

4. I then investigated some streaming libraries, but am confused - there is
SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
options ( minidom)

Which one is the best for my situation ?

Should i instead just open the file, & use reg ex to look for the element i
need ?


Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/
of the Python tutor community would be greatly appreciated.
Plz feel free to email me directly too.

thanks a ton

cheers
ashish

email :
ashish.makani
domain:gmail.com

p.s.
Other useful links on xml parsing in python
0. http://diveintopython3.org/xml.html
1.
http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
2. http://codespeak.net/lxml/tutorial.html
3.
https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
5.http://effbot.org/zone/element-index.htm
http://effbot.org/zone/element-iterparse.htm
6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/4c2c065e/attachment.html>

Search Discussions

  • Luke Paireepinart at Dec 20, 2010 at 8:42 pm
    If you can assume a well formatted file I would just parse it linearly, should be much faster. Read the file in as lines if the XML is already in human readable form, or just read in blocks and append to a list and do a join() when you have a whole match.

    -----------------------------
    Sent from a mobile device with a bad e-mail client.
    -----------------------------
    On Dec 20, 2010, at 2:08 PM, ashish makani wrote:


    Hi Python Tutor folks
    This is a rather long post, but i wanted to include all the details & everything i have tried so far myself, so please bear with me & read the entire boringly long post.

    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.

    I am looking for a specific element..there are several 10s/100s occurrences of that element in the 1gb file.

    I need to detect them & then for each 1, i need to copy all the content b/w the element's start & end tags & create a smaller xml


    0. I am a python & xml n00b, s& have been relying on the excellent beginner book DIP(Dive_Into_Python3 by MP(Mark Pilgrim).... Mark , if u are readng this, you are AWESOME & so is your witty & humorous writing style)

    My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
    On this i am running sun virtualbox(3.2.12), with ubuntu 10.10(maverick) as guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest ubuntu os.


    1. Almost all exmaples pf parsing xml in python, i have seen, start off with these 4 lines of code.

    import xml.etree.ElementTree as etree
    tree = etree.parse('*path_to_ginormous_xml*')
    root = tree.getroot() #my huge xml has 1 root at the top level
    print root

    2. In the 2nd line of code above, as Mark explains in DIP, the parse function builds & returns a tree object, in-memory(RAM), which represents the entire document.
    I tried this code, which works fine for a small ( ~ 1MB), but when i run this simple 4 line py code in a terminal for my HUGE target file (1GB), nothing happens.
    In a separate terminal, i run the top command, & i can see a python process, with memory (the VIRT column) increasing from 100MB , all the way upto 2100MB.

    I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
    I dont get an error, seg fault or out_of_memory exception.

    3. I also tried using lxml, but an lxml tree is much more expensive, as it retains more info about a node's context, including references to it's parent.

    [http://www.ibm.com/developerworks/xml/library/x-hiperfparse/]

    When i ran the same 4line code above, but with lxml's elementree ( using the import below in line1of the code above)
    import lxml.etree as lxml_etree

    i can see the memory consumption of the python process(which is running the code) shoot upto ~ 2700mb & then, python(or the os ?) kills the process as it nears the total system memory(2gb)

    I ran the code from 1 terminal window (screenshot :http://imgur.com/ozLkB.png)
    & ran top from another terminal (http://imgur.com/HAoHA.png)

    4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm], & several otehr options ( minidom)

    Which one is the best for my situation ?

    Should i instead just open the file, & use reg ex to look for the element i need ?



    Any & all code_snippets/wisdom/thoughts/ideas/suggestions/feedback/comments/ of the Python tutor community would be greatly appreciated.
    Plz feel free to email me directly too.

    thanks a ton

    cheers
    ashish

    email :
    ashish.makani
    domain:gmail.com

    p.s.
    Other useful links on xml parsing in python
    0. http://diveintopython3.org/xml.html
    1. http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator
    2. http://codespeak.net/lxml/tutorial.html
    3. https://groups.google.com/forum/?hl=en&lnk=gst&q=parsing+a+huge+xml#!topic/comp.lang.python/CMgToEnjZBk
    4. http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    5.http://effbot.org/zone/element-index.htm
    http://effbot.org/zone/element-iterparse.htm
    6. SAX : http://en.wikipedia.org/wiki/Simple_API_for_XML

    _______________________________________________
    Tutor maillist - Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/2d4172f5/attachment-0001.html>
  • Steven D'Aprano at Dec 20, 2010 at 9:19 pm

    ashish makani wrote:

    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
    I sympathize with you. I wonder who thought that building a 1GB XML file
    was a good thing.

    Forget about using any XML parser that reads the entire file into
    memory. By the time that 1GB of text is read and parsed, you will
    probably have something about 6-8GB (estimated) in size.


    [...]
    My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad
    cpuq9400.
    In order to access 8GB of RAM, you'll be running a 64-bit OS, correct?
    In this case, you should expect double the memory usage of the XML
    object to (estimated) 12-16GB.

    I am guessing, as this happens (over the course of 20-30 mins), the tree
    representing is being slowly built in memory, but even after 30-40 mins,
    nothing happens.
    It's probably not finished. Leave it another hour or so and you'll get
    an out of memory error.

    4. I then investigated some streaming libraries, but am confused - there is
    SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
    interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
    options ( minidom)

    Which one is the best for my situation ?
    You absolutely need to use a streaming library. element-iterparse still
    builds the tree, so that's no use to you. I believe you should use SAX
    or minidom, but that's about my limit of knowledge of streaming XML parsers.

    Should i instead just open the file, & use reg ex to look for the element i
    need ?
    That's likely to need less memory than building a parse tree, but still
    a huge amount of memory. And you don't know how complex the XML is, in
    general you *can't* correctly parse arbitrary XML with regular
    expressions (although you can for simple examples). Stick with the right
    tool for the job, the streaming XML library.


    --
    Steven
  • Brett Ritter at Dec 20, 2010 at 9:40 pm

    On Mon, Dec 20, 2010 at 4:19 PM, Steven D'Aprano wrote:
    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
    I sympathize with you. I wonder who thought that building a 1GB XML file was
    a good thing.
    XML is like violence: if it isn't working, try more.

    --
    Brett Ritter / SwiftOne
    swiftone at swiftone.org
  • Sithembewena Lloyd Dube at Dec 20, 2010 at 9:50 pm
    [?] Brett, that was very mischievous.

    I wish I could help - am watching this thread with great curiosity, I could
    learn something from it myself.
    On Mon, Dec 20, 2010 at 11:40 PM, Brett Ritter wrote:
    On Mon, Dec 20, 2010 at 4:19 PM, Steven D'Aprano wrote:
    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
    I sympathize with you. I wonder who thought that building a 1GB XML file was
    a good thing.
    XML is like violence: if it isn't working, try more.

    --
    Brett Ritter / SwiftOne
    swiftone at swiftone.org
    _______________________________________________
    Tutor maillist - Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    Regards,
    Sithembewena Lloyd Dube
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/f62281fd/attachment.html>
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: 338.gif
    Type: image/gif
    Size: 541 bytes
    Desc: not available
    URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/f62281fd/attachment.gif>
  • Steven D'Aprano at Dec 20, 2010 at 10:32 pm

    Brett Ritter wrote:
    On Mon, Dec 20, 2010 at 4:19 PM, Steven D'Aprano wrote:
    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
    I sympathize with you. I wonder who thought that building a 1GB XML file was
    a good thing.
    XML is like violence: if it isn't working, try more.
    I love it -- may I quote you?


    --
    Steven
  • Brett Ritter at Dec 21, 2010 at 12:37 am

    On Mon, Dec 20, 2010 at 5:32 PM, Steven D'Aprano wrote:
    XML is like violence: if it isn't working, try more.
    I love it -- may I quote you?
    I can't claim credit for it, I saw originally saw it on some sigs on
    Slashdot a few years ago. It certainly matches the majority of XML
    usage I've encountered.

    As to the original post: Yes, as others have suggested you're going to
    want an event-based parser along the lines of SAX. Sadly (for you)
    this means a mental shift in how you address your code, but it's not
    terrible - just different.

    --
    Brett Ritter / SwiftOne
    swiftone at swiftone.org
  • Stefan Behnel at Dec 21, 2010 at 8:44 am
    [note that this has also been posted to comp.lang.python and discussed
    separately over there]

    Steven D'Aprano, 20.12.2010 22:19:
    ashish makani wrote:
    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
    I sympathize with you. I wonder who thought that building a 1GB XML file
    was a good thing.

    Forget about using any XML parser that reads the entire file into memory.
    By the time that 1GB of text is read and parsed, you will probably have
    something about 6-8GB (estimated) in size.
    The in-memory size is highly dependent on the data, specifically the
    text-to-structure ratio. If it's a lot of text content, the difference to
    the serialised tree will be small. If it's a lot of structure with tiny
    bits of text content, the in-memory size of the tree will be a lot larger.

    I am guessing, as this happens (over the course of 20-30 mins), the tree
    representing is being slowly built in memory, but even after 30-40 mins,
    nothing happens.
    It's probably not finished. Leave it another hour or so and you'll get an
    out of memory error.
    Right, if it gets into wild swapping, it can slow down almost to a halt,
    even though the XML parsing itself tends to have pretty good memory
    locality (but the ever growing in-memory tree obviously doesn't).

    4. I then investigated some streaming libraries, but am confused - there is
    SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
    interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
    options ( minidom)

    Which one is the best for my situation ?
    You absolutely need to use a streaming library. element-iterparse still
    builds the tree, so that's no use to you.
    Wrong. iterparse() allows you to cut branches in the tree while it's
    growing, that's exactly what it's there for.

    I believe you should use SAX or
    minidom, but that's about my limit of knowledge of streaming XML parsers.
    With "minidom" being an advice that's even worse than SAX - SAX would at
    least solve the problem, whereas minidom wouldn't because of its
    intolerable memory requirements.

    Stefan
  • David Hutto at Dec 21, 2010 at 8:49 am

    On Tue, Dec 21, 2010 at 3:44 AM, Stefan Behnel wrote:
    [note that this has also been posted to comp.lang.python and discussed
    separately over there]

    Steven D'Aprano, 20.12.2010 22:19:
    ashish makani wrote:
    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
    I sympathize with you. I wonder who thought that building a 1GB XML file
    was a good thing.
    David Mertz, Ph.D.
    Comparator, Gnosis Software, Inc.
    June 2003

    http://gnosis.cx/publish/programming/xml_matters_29.html


    that was just the first listing:

    http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8



    Forget about using any XML parser that reads the entire file into memory.
    By the time that 1GB of text is read and parsed, you will probably have
    something about 6-8GB (estimated) in size.
    The in-memory size is highly dependent on the data, specifically the
    text-to-structure ratio. If it's a lot of text content, the difference to
    the serialised tree will be small. If it's a lot of structure with tiny bits
    of text content, the in-memory size of the tree will be a lot larger.

    I am guessing, as this happens (over the course of 20-30 mins), the tree
    representing is being slowly built in memory, but even after 30-40 mins,
    nothing happens.
    It's probably not finished. Leave it another hour or so and you'll get an
    out of memory error.
    Right, if it gets into wild swapping, it can slow down almost to a halt,
    even though the XML parsing itself tends to have pretty good memory locality
    (but the ever growing in-memory tree obviously doesn't).

    4. I then investigated some streaming libraries, but am confused - there
    is
    SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
    interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
    options ( minidom)

    Which one is the best for my situation ?
    You absolutely need to use a streaming library. element-iterparse still
    builds the tree, so that's no use to you.
    Wrong. iterparse() allows you to cut branches in the tree while it's
    growing, that's exactly what it's there for.

    I believe you should use SAX or
    minidom, but that's about my limit of knowledge of streaming XML parsers.
    With "minidom" being an advice that's even worse than SAX - SAX would at
    least solve the problem, whereas minidom wouldn't because of its intolerable
    memory requirements.

    Stefan

    _______________________________________________
    Tutor maillist ?- ?Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    They're installing the breathalyzer on my email account next week.
  • David Hutto at Dec 21, 2010 at 8:53 am
    But then again, maybe it's too much of an optimization for someone not
    optimizing for others or a specific application for the hardware, or
    it's not part of the standard python library, and therefore,
    expendable.
  • Stefan Behnel at Dec 21, 2010 at 8:57 am

    David Hutto, 21.12.2010 09:49:
    Steven D'Aprano, 20.12.2010 22:19:
    ashish makani wrote:
    Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
    I sympathize with you. I wonder who thought that building a 1GB XML file
    was a good thing.
    http://gnosis.cx/publish/programming/xml_matters_29.html
    Fredrik Lundh's cElementTree page has a benchmark for that, too. It's
    actually slower than cElementTree for the case he tested (which was
    basically "parsing" :)

    http://effbot.org/zone/celementtree.htm#benchmarks

    Stefan
  • David Hutto at Dec 21, 2010 at 9:12 am
    .
    I sympathize with you. I wonder who thought that building a 1GB XML file
    was a good thing.
    If it is:

    XML stands for eXtensible Markup Language.

    XML is designed to transport and store data.


    Then what other file medium would you suggest as the tagging means.

    You have a file with tags, you can't parse and store the data in any
    file anymore than the next, right?

    So the tags and how they are marked by any module or file extension
    searcher shouldn't matter, right?
    Fredrik Lundh's cElementTree page has a benchmark for that, too. It's
    actually slower than cElementTree for the case he tested (which was
    basically "parsing" :)

    http://effbot.org/zone/celementtree.htm#benchmarks

    Stefan

    _______________________________________________
    Tutor maillist ?- ?Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    They're installing the breathalyzer on my email account next week.
  • Stefan Behnel at Dec 21, 2010 at 9:28 am
    Hi,

    I wonder why you reply to my e-mail without replying to what I wrote in it.


    David Hutto, 21.12.2010 10:12:
    .
    I sympathize with you. I wonder who thought that building a 1GB XML file
    was a good thing.
    This was written by Steven D'Aprano.

    If it is:

    XML stands for eXtensible Markup Language.

    XML is designed to transport and store data.


    Then what other file medium would you suggest as the tagging means.
    There are different file formats for structured and semi-structured data.
    XML certainly isn't the only one, and people have been defining specific
    formats for their specific use cases for ages, for better or worse each time.

    Personally, I don't think GB-sized XML files are bad per-se. It depends on
    the use case, and it depends on what's considered a suitable solution in a
    given environment. Also note that XML tends to compress pretty well, and
    that it's sometimes faster to parse gzipped XML than uncompressed XML. So
    the serialised file size by itself isn't an argument, either.

    You have a file with tags, you can't parse and store the data in any
    file anymore than the next, right?

    So the tags and how they are marked by any module or file extension
    searcher shouldn't matter, right?
    I don't think I can extract the intended meaning from the assembled words
    you use here.

    Stefan
  • David Hutto at Dec 21, 2010 at 9:37 am

    On Tue, Dec 21, 2010 at 4:28 AM, Stefan Behnel wrote:
    Hi,

    I wonder why you reply to my e-mail without replying to what I wrote in it.


    David Hutto, 21.12.2010 10:12:
    .
    I sympathize with you. I wonder who thought that building a 1GB XML
    file
    was a good thing.
    This was written by Steven D'Aprano.
    My bad, human parsing has errors too.
    If it is:

    XML stands for eXtensible Markup Language.

    XML is designed to transport and store data.


    Then what other file medium would you suggest as the tagging means.
    There are different file formats for structured and semi-structured data.
    XML certainly isn't the only one, and people have been defining specific
    formats for their specific use cases for ages, for better or worse each
    time.
    But it's all a string of coded text with only the formats that define
    the markups within though.

    String format + text in file(type of coding for lang)


    Personally, I don't think GB-sized XML files are bad per-se. It depends on
    the use case, and it depends on what's considered a suitable solution in a
    given environment. Also note that XML tends to compress pretty well, and
    that it's sometimes faster to parse gzipped XML than uncompressed XML. So
    the serialised file size by itself isn't an argument, either.
    So the zipped file in compressed doesn't contain compressed tags, or
    data, then why is it compressed?
    You have a file with tags, you can't parse and store the data in any
    file anymore than the next, right?

    So the tags and how they are marked by any module or file extension
    searcher shouldn't matter, right?
    The phrase:
    <tag> in a php file
    <tag> in a xml file
    <tag> in an html file.

    if read in any file it's the same, as
    <tag>

    How does the file extension make it any longer?
    This is know matter how it's interpreted by any other mechanism than
    just reading the text within, right?
    I don't think I can extract the intended meaning from the assembled words
    you use here.

    Stefan

    _______________________________________________
    Tutor maillist ?- ?Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    They're installing the breathalyzer on my email account next week.
  • Alan Gauld at Dec 21, 2010 at 10:03 am
    "David Hutto" <smokefloat at gmail.com> wrote
    XML stands for eXtensible Markup Language.
    XML is designed to transport and store data.

    Then what other file medium would you suggest as the tagging means.
    See my other post but there are many alternatives that are orders
    of magnitude more efficient. XML is one of the most inefficient
    data transport mechanisms ever invented and its main redeeming
    feature is its human readability.
    You have a file with tags, you can't parse and store the data in any
    file anymore than the next, right?
    Wrong, even CSV files are more efficient than parsing XML.
    (But are very limited in their data structure)

    But binary based formats like IDL and ASN.1 can be parsed
    very efficiently and, because they are binary based, store
    (and therefore transmit) their data much more efficiently too.

    HTH,

    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/
  • David Hutto at Dec 21, 2010 at 10:22 am
    Give me a little time to review this when it's not 5:30 in the morning
    and I've been up since 9 am yesterday, and 'relearning' c++:)

    But it still seems that you have have coding + filetype +
    charactersinfileinformat., one long string that has to be parsed by
    the C functions.
  • Eike Welk at Dec 21, 2010 at 10:51 am

    On Tuesday 21.12.2010 10:12:55 David Hutto wrote:
    Then what other file medium would you suggest as the tagging means.
    One of those formats, that are specially designed for large amounts of data,
    is HDF5. It is intended for numerical data, but you can store text as well.
    There are multiple Python libraries for it, the most feature rich is IMHO
    PyTables.

    http://www.pytables.org/moin


    Eike.
  • Alan Gauld at Dec 21, 2010 at 9:58 am
    "David Hutto" <smokefloat at gmail.com> wrote
    I sympathize with you. I wonder who thought that building a 1GB XML
    file
    was a good thing.
    that was just the first listing:

    http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
    Eeek! One of the listings says:
    22 Jan 2009 ... Stripping Illegal Characters from XML in Python >>
    ... I'd be asking Python to process 6.4 gigabytes of CSV into
    6.5 gigabytes of XML 1. ..... In fact, what happened was that
    the parsing didn't work and the whole db was ...

    And I thought a 1G file was extreme... Do these people stop to think
    that
    with XML as much as 80% of their "data" is just description (ie the
    tags).

    Alan G.
  • David Hutto at Dec 21, 2010 at 10:09 am

    On Tue, Dec 21, 2010 at 4:58 AM, Alan Gauld wrote:
    "David Hutto" <smokefloat at gmail.com> wrote
    I sympathize with you. I wonder who thought that building a 1GB XML file
    was a good thing.
    that was just the first listing:


    http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
    Eeek! One of the listings says:
    22 Jan 2009 ... Stripping Illegal Characters from XML in Python >>
    ... I'd be asking Python to process 6.4 gigabytes of CSV into
    6.5 gigabytes of XML 1. ..... In fact, what happened was that
    the parsing didn't work and the whole db was ...

    And I thought a 1G file was extreme... Do these people stop to think that
    with XML as much as 80% of their "data" is just description (ie the tags).

    That';s what I saying above that xml seems to be the hog in terms of
    it's user defined tags. Is that somewhat a confirmation of my hunch,
    that it's the length of the users predefined tags that add to the
    above mess, and that maybe a lessened tag system in accordance with
    xml might be better, or a simple <a> tag <b> tag in the xml(other
    files) with an index to point to a and b would be better.
  • Alan Gauld at Dec 21, 2010 at 10:45 am
    "David Hutto" <smokefloat at gmail.com> wrote
    That';s what I saying above that xml seems to be the hog in terms of
    it's user defined tags. Is that somewhat a confirmation of my hunch,
    that it's the length of the users predefined tags that add to the
    above mess, and that maybe a lessened tag system in accordance with
    xml might be better, or a simple <a> tag <b> tag in the xml(other
    files) with an index to point to a and b would be better.
    Shorter tags reduce the data volume by a bit (and it can be a
    big bit if the names are all 20 characters long!) but the inherent tag
    structure, even with single char names will still often surpass the
    data content.

    <i>
    5
    </i>

    8 bytes to describe an int which could be represented in
    a single byte in binary (or even in CSV). Even if the int were
    a 64bit binary value (8 bytes) the minimal tag structure still
    consumes the same data width. Of course if the data
    content is a long string then simple tags become cost
    effective (think <p> in XHTML)...

    HTH,


    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/
  • David Hutto at Dec 21, 2010 at 11:02 am

    On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:
    "David Hutto" <smokefloat at gmail.com> wrote
    That';s what I saying above that xml seems to be the hog in terms of
    it's user defined tags. Is that somewhat a confirmation of my hunch,
    that it's the length of the users predefined tags that add to the
    above mess, and that maybe a lessened tag system in accordance with
    xml might be better, or a simple <a> tag <b> tag in the xml(other
    files) with an index ?to point to a and b would be better.
    Shorter tags reduce the data volume by a bit (and it can be a
    big bit if the names are all 20 characters long!) but the inherent tag
    structure, even with single char names will still often surpass the
    data content.

    <i>
    5
    </i>
    8 bytes to describe an int which could be represented in
    a single byte in binary (or even in CSV).
    But that byte can't describe the tag(google hold my hand). I'll get
    this eventually, but my iostream is long on content and hard on
    parsing. So many languages, and technology, yet so little time.

    Even if the int were
    a 64bit binary value (8 bytes) the minimal tag structure still
    consumes the same data width. Of course if the data
    content is a long string then simple tags become cost
    effective (think <p> in XHTML)...

    HTH,


    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/


    _______________________________________________
    Tutor maillist ?- ?Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    They're installing the breathalyzer on my email account next week.
  • Stefan Behnel at Dec 21, 2010 at 11:19 am

    David Hutto, 21.12.2010 12:02:
    On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:
    8 bytes to describe an int which could be represented in
    a single byte in binary (or even in CSV).
    Well, "CSV" indicates that there's at least one separator character
    involved, so make that an asymptotic 2 bytes on average. But obviously,
    compression applies to CSV and other 'readable' formats as well.

    But that byte can't describe the tag
    Yep, that's an argument that Alan already presented.

    Stefan
  • David Hutto at Dec 21, 2010 at 11:41 am

    On Tue, Dec 21, 2010 at 6:19 AM, Stefan Behnel wrote:
    David Hutto, 21.12.2010 12:02:
    On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:

    8 bytes to describe an int which could be represented in
    a single byte in binary (or even in CSV).
    Well, "CSV" indicates that there's at least one separator character
    involved, so make that an asymptotic 2 bytes on average. But obviously,
    compression applies to CSV and other 'readable' formats as well.

    But that byte can't describe the tag
    Yep, that's an argument that Alan already presented.
    Didn't see that, but that would make the minimal format for parsing a
    comma, or any other single character marker, and the minimal would
    still be a specific marker in a file, but does not answer my question
    about the assignment to another file's variable.

    If file a.xml has simple tagged xml like <a>, and file b.config has
    tags that represent the a.xml(i.e.<a> = <antonym>) as greater tags,
    does this pattern optimize the process by limiting the size of the
    tags to be parsed in the xml, then converting those simpler tags that
    are found to the b.config values for the simple <a-z> simple format?
  • David Hutto at Dec 21, 2010 at 11:45 am

    On Tue, Dec 21, 2010 at 6:41 AM, David Hutto wrote:
    On Tue, Dec 21, 2010 at 6:19 AM, Stefan Behnel wrote:
    David Hutto, 21.12.2010 12:02:
    On Tue, Dec 21, 2010 at 5:45 AM, Alan Gauld wrote:

    8 bytes to describe an int which could be represented in
    a single byte in binary (or even in CSV).
    Well, "CSV" indicates that there's at least one separator character
    involved, so make that an asymptotic 2 bytes on average. But obviously,
    compression applies to CSV and other 'readable' formats as well.

    But that byte can't describe the tag
    Yep, that's an argument that Alan already presented.
    Didn't see that, but that would make the minimal format for parsing a
    comma, or any other single character marker, and the minimal would
    still be a specific marker in a file, but does not answer my question
    about the assignment to another file's variable.

    If file a.xml has simple tagged xml like <a>, and file b.config has
    tags that represent the a.xml(i.e.<a> = <antonym>) as greater tags,
    does this pattern optimize the process by limiting the size of the
    tags to be parsed in the xml, then converting those simpler tags that
    are found to the b.config values for the simple <a-z> simple format?
    In other words I'm lazy and asking for the experiment to be performed
    for me(or, more importantly, if it has been), but since I'm not new to
    this, if no one has a specific case, I'll timeit when I get to it.
  • Stefan Behnel at Dec 21, 2010 at 11:59 am

    David Hutto, 21.12.2010 12:45:
    If file a.xml has simple tagged xml like<a>, and file b.config has
    tags that represent the a.xml(i.e.<a> =<antonym>) as greater tags,
    does this pattern optimize the process by limiting the size of the
    tags to be parsed in the xml, then converting those simpler tags that
    are found to the b.config values for the simple<a-z> simple format?
    In other words I'm lazy and asking for the experiment to be performed
    for me(or, more importantly, if it has been), but since I'm not new to
    this, if no one has a specific case, I'll timeit when I get to it.
    I'm still not sure I understand what you are trying to describe here, but I
    think you want to look into the Wikipedia articles on indexing, hashing and
    compression.

    http://en.wikipedia.org/wiki/Index_%28database%29
    http://en.wikipedia.org/wiki/Index_%28information_technology%29
    http://en.wikipedia.org/wiki/Hash_function
    http://en.wikipedia.org/wiki/Data_compression

    Terms like "indirection" and "mapping" also come to my mind when I try to
    make sense out of your hints.

    Stefan
  • David Hutto at Dec 21, 2010 at 12:09 pm

    On Tue, Dec 21, 2010 at 6:59 AM, Stefan Behnel wrote:
    David Hutto, 21.12.2010 12:45:
    If file a.xml has simple tagged xml like<a>, and file b.config has
    tags that represent the a.xml(i.e.<a> ?=<antonym>) as greater tags,
    does this pattern optimize the process by limiting the size of the
    tags to be parsed in the xml, then converting those simpler tags that
    are found to the b.config values for the simple<a-z> ?simple format?
    I forget to insert my tags...

    <joke>
    In other words I'm lazy and asking for the experiment to be performed
    for me(or, more importantly, if it has been), but since I'm not new to
    this, if no one has a specific case, I'll timeit when I get to it.
    </joke>
    I'm still not sure I understand what you are trying to describe here, but I
    think you want to look into the Wikipedia articles on indexing, hashing and
    compression.
    a.xml has tags with simplistic forms, like was argued above, with <a>,
    or <b>. b.config has variables for the simple tags in a.xml so that
    <a> = <alpha> in b.config.

    So when parsing a.xml, you parse it, then use more complex tags to
    define with b.config.. I'll review the url's a little later.



    Terms like tags, and xml also come to mind. Or parsing, or regular
    expressions, or re, or find, or alot of things come to mind. My
    experience is limited, but not by much, and certainly not in respect
    to the scope of other languages. But thank you for the references, I'm
    not so good, that, I can't afford to look through a bunch of coal to
    find a diamond.

    Stefan

    _______________________________________________
    Tutor maillist ?- ?Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    They're installing the breathalyzer on my email account next week.
  • Stefan Behnel at Dec 21, 2010 at 12:26 pm

    David Hutto, 21.12.2010 13:09:
    On Tue, Dec 21, 2010 at 6:59 AM, Stefan Behnel wrote:
    David Hutto, 21.12.2010 12:45:
    If file a.xml has simple tagged xml like<a>, and file b.config has
    tags that represent the a.xml(i.e.<a> =<antonym>) as greater tags,
    does this pattern optimize the process by limiting the size of the
    tags to be parsed in the xml, then converting those simpler tags that
    are found to the b.config values for the simple<a-z> simple format?
    In other words I'm lazy and asking for the experiment to be performed
    for me(or, more importantly, if it has been), but since I'm not new to
    this, if no one has a specific case, I'll timeit when I get to it.
    I'm still not sure I understand what you are trying to describe here
    a.xml has tags with simplistic forms, like was argued above, with<a>,
    or<b>. b.config has variables for the simple tags in a.xml so that
    <a> =<alpha> in b.config.

    So when parsing a.xml, you parse it, then use more complex tags to
    define with b.config.. I'll review the url's a little later.
    Ok, I'd call that simple renaming, that's what I meant with "indirection"
    and "mapping" (basically the two concepts that computer science is all
    about ;).

    Sure, run your own benchmarks, but don't expect anyone to be interested in
    the results. If your interest is to obfuscate the tag names, why not just
    use a binary (or less readable) format? That gives you much better
    obfuscation in the first place.

    Stefan
  • Stefan Behnel at Dec 21, 2010 at 10:19 am

    Alan Gauld, 21.12.2010 10:58:
    "David Hutto" wrote
    http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
    Eeek! One of the listings says:
    22 Jan 2009 ... Stripping Illegal Characters from XML in Python >>
    ... I'd be asking Python to process 6.4 gigabytes of CSV into
    6.5 gigabytes of XML 1. ..... In fact, what happened was that
    the parsing didn't work and the whole db was ...

    And I thought a 1G file was extreme... Do these people stop to think that
    with XML as much as 80% of their "data" is just description (ie the tags).
    As I already said, it compresses well. In run-length compressed XML files,
    the tags can easily take up a negligible amount of space compared to the
    more widely varying data content (although that also commonly tends to
    compress rather well). And depending on how fast your underlying storage
    is, decompressing and parsing the file may still be faster than parsing a
    huge uncompressed file directly. So, again, the shear uncompressed file
    size is *not* a very interesting argument.

    Stefan
  • David Hutto at Dec 21, 2010 at 10:29 am

    On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:
    Alan Gauld, 21.12.2010 10:58:
    "David Hutto" wrote

    http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8
    Eeek! One of the listings says:
    22 Jan 2009 ... Stripping Illegal Characters from XML in Python >>
    ... I'd be asking Python to process 6.4 gigabytes of CSV into
    6.5 gigabytes of XML 1. ..... In fact, what happened was that
    the parsing didn't work and the whole db was ...

    And I thought a 1G file was extreme... Do these people stop to think that
    with XML as much as 80% of their "data" is just description (ie the tags).
    As I already said, it compresses well. In run-length compressed XML files,
    the tags can easily take up a negligible amount of space compared to the
    more widely varying data content (although that also commonly tends to
    compress rather well). And depending on how fast your underlying storage is,
    decompressing and parsing the file may still be faster than parsing a huge
    uncompressed file directly. So, again, the shear uncompressed file size is
    *not* a very interesting argument.
    However, could they (as mentioned elsewhere, and by other in another
    form)mitigate the damage by using smaller tags exclusively? And also
    compressed is formatted, even for the tags, correct?
  • Stefan Behnel at Dec 21, 2010 at 10:49 am

    David Hutto, 21.12.2010 11:29:
    On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:
    Alan Gauld, 21.12.2010 10:58:
    22 Jan 2009 ... Stripping Illegal Characters from XML in Python>>
    ... I'd be asking Python to process 6.4 gigabytes of CSV into
    6.5 gigabytes of XML 1. ..... In fact, what happened was that
    the parsing didn't work and the whole db was ...

    And I thought a 1G file was extreme... Do these people stop to think that
    with XML as much as 80% of their "data" is just description (ie the tags).
    As I already said, it compresses well. In run-length compressed XML files,
    the tags can easily take up a negligible amount of space compared to the
    more widely varying data content (although that also commonly tends to
    compress rather well). And depending on how fast your underlying storage is,
    decompressing and parsing the file may still be faster than parsing a huge
    uncompressed file directly. So, again, the shear uncompressed file size is
    *not* a very interesting argument.
    However, could they (as mentioned elsewhere, and by other in another
    form)mitigate the damage by using smaller tags exclusively?
    Why should that have a (noticeable) impact on the compressed file? It's the
    inherent nature of compression to reduce redundancy, which in XML files
    usually includes the redundancy of repeated tag names (even if the
    compression is not specifically XML aware).

    It's a very bad idea to use short and obfuscated tag names to reduce the
    storage size. That's like coding in assembler to reduce the size of the
    source code. Just use compression for storage, or buy a larger hard disk
    for your NAS.

    And also compressed is formatted, even for the tags, correct?
    The (lossless) compression doesn't change the content.

    Stefan
  • David Hutto at Dec 21, 2010 at 11:08 am

    On Tue, Dec 21, 2010 at 5:49 AM, Stefan Behnel wrote:
    David Hutto, 21.12.2010 11:29:
    On Tue, Dec 21, 2010 at 5:19 AM, Stefan Behnel wrote:

    Alan Gauld, 21.12.2010 10:58:
    22 Jan 2009 ... Stripping Illegal Characters from XML in Python>>
    ... I'd be asking Python to process 6.4 gigabytes of CSV into
    6.5 gigabytes of XML 1. ..... In fact, what happened was that
    the parsing didn't work and the whole db was ...

    And I thought a 1G file was extreme... Do these people stop to think
    that
    with XML as much as 80% of their "data" is just description (ie the
    tags).
    As I already said, it compresses well. In run-length compressed XML
    files,
    the tags can easily take up a negligible amount of space compared to the
    more widely varying data content (although that also commonly tends to
    compress rather well). And depending on how fast your underlying storage
    is,
    decompressing and parsing the file may still be faster than parsing a
    huge
    uncompressed file directly. So, again, the shear uncompressed file size
    is
    *not* a very interesting argument.
    However, could they (as mentioned elsewhere, and by other in another
    form)mitigate the damage by using smaller tags exclusively?
    Why should that have a (noticeable) impact on the compressed file? It's the
    inherent nature of compression to reduce redundancy, which in XML files
    usually includes the redundancy of repeated tag names (even if the
    compression is not specifically XML aware).

    It's a very bad idea to use short and obfuscated tag names to reduce the
    storage size.

    Maybe my style is a form of bad coder example, in some areas(present
    company accepted). For example, I have a dictionary that has codes
    within a text file, that point to other lines for verbs, adj, nouns,
    etc.
    So <a> doesn't have to mean a it could mean <a> = <antonym>, but would
    that help in making the initial usage of <a> in the xml file faster,
    or slower, by parsing for <a> then relating <a> to <antonym>?


    That's like coding in assembler to reduce the size of the
    source code.
    Haven't gotten to assembler yet, almost there.


    Just use compression for storage, or buy a larger hard disk for
    your NAS.

    And also compressed is formatted, even for the tags, correct?
    The (lossless) compression doesn't change the content.
    google search later, I promise.

    Stefan

    _______________________________________________
    Tutor maillist ?- ?Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    They're installing the breathalyzer on my email account next week.
  • Alan Gauld at Dec 21, 2010 at 2:11 pm
    "Stefan Behnel" <stefan_ml at behnel.de> wrote
    And I thought a 1G file was extreme... Do these people stop to
    think that
    with XML as much as 80% of their "data" is just description (ie the
    tags).
    As I already said, it compresses well. In run-length compressed XML
    files, the tags can easily take up a negligible amount of space
    compared to the more widely varying data content
    I understand how compression helps with the data transmission aspect.
    compress rather well). And depending on how fast your underlying
    storage is, decompressing and parsing the file may still be faster
    than parsing a huge uncompressed file directly.
    But I don't understand how uncompressing a file before parsing it can
    be faster than parsing the original uncompressed file?

    There are ways of processing xml to reduce the tag space (a bit like
    tinyurl does with long urls) but then the parsing code has to know
    about the tag translations too - and usually the savings are small.

    Curious,

    Alan G.
  • Stefan Behnel at Dec 21, 2010 at 3:03 pm

    Alan Gauld, 21.12.2010 15:11:
    "Stefan Behnel" wrote
    And I thought a 1G file was extreme... Do these people stop to think that
    with XML as much as 80% of their "data" is just description (ie the tags).
    As I already said, it compresses well. In run-length compressed XML
    files, the tags can easily take up a negligible amount of space compared
    to the more widely varying data content
    I understand how compression helps with the data transmission aspect.
    compress rather well). And depending on how fast your underlying storage
    is, decompressing and parsing the file may still be faster than parsing a
    huge uncompressed file directly.
    But I don't understand how uncompressing a file before parsing it can
    be faster than parsing the original uncompressed file?
    I didn't say "uncompressing a file *before* parsing it". I meant
    uncompressing the data *while* parsing it. Just like you have to decode it
    for parsing, it's just an additional step to decompress it before decoding.
    Depending on the performance relation between I/O speed and decompression
    speed, it can be faster to load the compressed data and decompress it into
    the parser on the fly. lxml.etree (or rather libxml2) internally does that
    for you, for example, if it detects compressed input when parsing from a file.

    Note that these performance differences are tricky to prove in benchmarks,
    as repeating the benchmark usually means that the file is already cached in
    memory after the first run, so the decompression overhead will dominate in
    the second run. That's not what you will see in a clean run or for huge
    files, though.

    Stefan
  • David Hutto at Dec 21, 2010 at 3:11 pm

    On Tue, Dec 21, 2010 at 10:03 AM, Stefan Behnel wrote:
    Alan Gauld, 21.12.2010 15:11:
    "Stefan Behnel" wrote
    And I thought a 1G file was extreme... Do these people stop to think
    that
    with XML as much as 80% of their "data" is just description (ie the
    tags).
    As I already said, it compresses well. In run-length compressed XML
    files, the tags can easily take up a negligible amount of space compared
    to the more widely varying data content
    I understand how compression helps with the data transmission aspect.
    compress rather well). And depending on how fast your underlying storage
    is, decompressing and parsing the file may still be faster than parsing a
    huge uncompressed file directly.
    But I don't understand how uncompressing a file before parsing it can
    be faster than parsing the original uncompressed file?
    I didn't say "uncompressing a file *before* parsing it".
    He didn't say utilizing code below Python either, but others will
    argue the microseconds matter, and if that's YOUR standard, then keep
    it for client and self.

    I meant
    uncompressing the data *while* parsing it. Just like you have to decode it
    for parsing, it's just an additional step to decompress it before decoding.
    Depending on the performance relation between I/O speed and decompression
    speed, it can be faster to load the compressed data and decompress it into
    the parser on the fly. lxml.etree (or rather libxml2) internally does that
    for you, for example, if it detects compressed input when parsing from a
    file.

    Note that these performance differences are tricky to prove in benchmarks,
    Tricky and proven, then tell me what real time, and this is in
    reference to a recent c++ discussion, is python used in ,andhow could
    it be utilized in....say an aviation system to avoid a collision when
    milliseconds are on the line?
    as repeating the benchmark usually means that the file is already cached in
    memory after the first run, so the decompression overhead will dominate in
    the second run. That's not what you will see in a clean run or for huge
    files, though.

    Stefan

    _______________________________________________
    Tutor maillist ?- ?Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    --
    They're installing the breathalyzer on my email account next week.
  • Stefan Behnel at Dec 21, 2010 at 3:41 pm

    David Hutto, 21.12.2010 16:11:
    On Tue, Dec 21, 2010 at 10:03 AM, Stefan Behnel wrote:
    I meant
    uncompressing the data *while* parsing it. Just like you have to decode it
    for parsing, it's just an additional step to decompress it before decoding.
    Depending on the performance relation between I/O speed and decompression
    speed, it can be faster to load the compressed data and decompress it into
    the parser on the fly. lxml.etree (or rather libxml2) internally does that
    for you, for example, if it detects compressed input when parsing from a
    file.

    Note that these performance differences are tricky to prove in benchmarks,
    Tricky and proven, then tell me what real time, and this is in
    reference to a recent c++ discussion, is python used in ,andhow could
    it be utilized in....say an aviation system to avoid a collision when
    milliseconds are on the line?
    I doubt that there are many aviation systems that send around gigabytes of
    compressed XML data milliseconds before a collision.

    I even doubt that air plane collision detection is time critical anywhere
    in the milliseconds range. After all, there's a pilot who has to react to
    the collision warning, and he or she will certainly need more than a couple
    of milliseconds to react, not to mention the time that it takes for the air
    plane to adapt its flight direction. If you plan the system in a way that
    makes milliseconds count, you can just as well replace it by a
    jack-in-the-box. Oh, and that might even speed up the reaction of the pilot. ;)

    So, no, if these systems ever come close to a somewhat recent state of
    technology, I wouldn't mind if they were written in Python. The CPython
    runtime is pretty predictable in its performance characteristics, after all.

    Stefan
  • Alan Gauld at Dec 21, 2010 at 5:57 pm
    "Stefan Behnel" <stefan_ml at behnel.de> wrote
    But I don't understand how uncompressing a file before parsing it
    can
    be faster than parsing the original uncompressed file?
    I didn't say "uncompressing a file *before* parsing it". I meant
    uncompressing the data *while* parsing it.
    Ah, ok that can work, although it does add a layer of processing
    to identify compressed v uncompressed data, but if I/O is the
    bottleneck then it could give an advantage.

    Alan g.
  • Walter Prins at Dec 21, 2010 at 9:13 pm

    On 21 December 2010 17:57, Alan Gauld wrote:
    "Stefan Behnel" <stefan_ml at behnel.de> wrote

    But I don't understand how uncompressing a file before parsing it can
    be faster than parsing the original uncompressed file?
    I didn't say "uncompressing a file *before* parsing it". I meant
    uncompressing the data *while* parsing it.
    Ah, ok that can work, although it does add a layer of processing
    to identify compressed v uncompressed data, but if I/O is the
    bottleneck then it could give an advantage.
    OK my apologies, I see my previous response was already circumscribed by
    later emails (which I had not read yet.) Feel free to ignore it. :)

    Walter
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20101221/225d58ba/attachment.html>
  • Stefan Behnel at Dec 22, 2010 at 8:19 am

    Walter Prins, 21.12.2010 22:13:
    On 21 December 2010 17:57, Alan Gauld wrote:
    "Stefan Behnel" wrote
    But I don't understand how uncompressing a file before parsing it can
    be faster than parsing the original uncompressed file?
    I didn't say "uncompressing a file *before* parsing it". I meant
    uncompressing the data *while* parsing it.
    Ah, ok that can work, although it does add a layer of processing
    to identify compressed v uncompressed data, but if I/O is the
    bottleneck then it could give an advantage.
    OK my apologies, I see my previous response was already circumscribed by
    later emails (which I had not read yet.) Feel free to ignore it. :)
    Not much of a reason to apologize. Especially on a newbee list like
    python-tutor, a few words more or a different way of describing things may
    help in widening the set of readers who understand and manage to follow
    other people's arguments.

    Stefan
  • Walter Prins at Dec 21, 2010 at 9:06 pm

    On 21 December 2010 14:11, Alan Gauld wrote:

    But I don't understand how uncompressing a file before parsing it can
    be faster than parsing the original uncompressed file?
    Because of IO overhead/benefits. It's not so much that the parsing aspect
    of it is faster of course (it is what it is), it's that the total time taken
    to (read+decompress+parse) is faster than just (read+parse), because the
    time to actually read the compressed data is less than the time it takes to
    decompress that data into RAM. Generally speaking, compared to your CPU and
    memory, with respect to IO your disk is always going to be the culprit,
    though of course it does depend on exactly how much data we're talking
    about, how fast your CPU is, etc.

    In general computing this is less of an issue nowadays than perhaps a few
    years ago, and the gains can be as you say small, or sometimes not so small,
    depending exactly how much data you've got, how highly compressed it's
    become, how fast/efficient the decompresser is, how slow your I/O channel is
    etc, but the point nevertheless stands. Case in point, it's perhaps
    interesting to note that this technique is used regularly on the web in
    general -- most web servers actually stream their HTML content as LZ
    compressed data streams, since (as above) it's quicker to compress, stream,
    decompress and parse than it is to just stream the data direct. (And, of
    course, thanks to zlib + urllib one can even use this feature from Python
    should you wish to do so.)

    Anyway, just my $0.02!

    Walter
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20101221/fc2d1603/attachment.html>
  • Alan Gauld at Dec 21, 2010 at 1:13 am
    "ashish makani" <ashish.makani at gmail.com> wrote
    I am looking for a specific element..there are several 10s/100s
    occurrences
    of that element in the 1gb file.

    I need to detect them & then for each 1, i need to copy all the
    content b/w
    the element's start & end tags & create a smaller xml
    This is exactly what sax and its kin are for. If you wanted to
    manipulate
    the xml data and recreate the original file tree based is better but
    for this
    kind of one shot processing SAX will be much much faster.

    The concept is simple enough if you have ever used awk to process
    text files. (or the Python HTMLParser) You define a function that gets
    triggered when the parser detects a matching tag.
    My hardware setup : I have a win7 pro box with 8gb of RAM & intel
    core2 quad
    cpuq9400.
    On this i am running sun virtualbox(3.2.12), with ubuntu
    10.10(maverick) as
    guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the
    guest
    ubuntu os.
    Obviously running the code in the virtuial machjine is limiting your
    ability to deal with the data but in this case you would be pushing
    hard to build the entire tree in RAM anyway so it probably doesn't
    matter.
    4. I then investigated some streaming libraries, but am confused -
    there is
    SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] ,
    Which one is the best for my situation ?
    I've only used sax - I tried minidom once but couldn't get it to work
    as I wanted so went back to sax... There are lots of examples of
    xml parsing using sax, both in Python and Java - just google.
    Should i instead just open the file, & use reg ex to look for the
    element i
    need ?
    Unless the xml is very simple you would probably find yourself
    creating a bigger problem. regex's are not good at handling the
    kinds of recursive data structures as can be found in SGML
    based languages.

    HTH,


    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/
  • Ashish makani at Dec 21, 2010 at 2:37 am
    Thanks Luke, Steve, Brett, Lloyd & Alan
    for your prompt responses & sharing your wisdom.

    I <3 the python community... You(We ?) folks are AWESOME

    I cross-posted this query on comp.lang.python
    I bet most of you hang @ c.l.p too, but just in case, here is the link to
    the discussion at c.l.p
    https://groups.google.com/d/topic/comp.lang.python/i816mDMSoXM/discussion

    Thanks again for the amazing help & advice

    cheers
    ashish

    On Mon, Dec 20, 2010 at 5:13 PM, Alan Gauld wrote:

    "ashish makani" <ashish.makani at gmail.com> wrote

    I am looking for a specific element..there are several 10s/100s
    occurrences
    of that element in the 1gb file.

    I need to detect them & then for each 1, i need to copy all the content
    b/w
    the element's start & end tags & create a smaller xml
    This is exactly what sax and its kin are for. If you wanted to manipulate
    the xml data and recreate the original file tree based is better but for
    this
    kind of one shot processing SAX will be much much faster.

    The concept is simple enough if you have ever used awk to process
    text files. (or the Python HTMLParser) You define a function that gets
    triggered when the parser detects a matching tag.


    My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2
    quad
    cpuq9400.
    On this i am running sun virtualbox(3.2.12), with ubuntu 10.10(maverick)
    as
    guest os, with 23gb disk space & 2gb(2048mb) ram, assigned to the guest
    ubuntu os.
    Obviously running the code in the virtuial machjine is limiting your
    ability to deal with the data but in this case you would be pushing
    hard to build the entire tree in RAM anyway so it probably doesn't
    matter.


    4. I then investigated some streaming libraries, but am confused - there
    is
    SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] ,
    Which one is the best for my situation ?
    I've only used sax - I tried minidom once but couldn't get it to work
    as I wanted so went back to sax... There are lots of examples of
    xml parsing using sax, both in Python and Java - just google.


    Should i instead just open the file, & use reg ex to look for the element
    i
    need ?
    Unless the xml is very simple you would probably find yourself
    creating a bigger problem. regex's are not good at handling the
    kinds of recursive data structures as can be found in SGML
    based languages.

    HTH,


    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/



    _______________________________________________
    Tutor maillist - Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor

    *

    "We act as though comfort and luxury were the chief requirements of life,
    when all that we need to make us happy is something to be enthusiastic
    about."
    -- Albert Einstein*
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/72add300/attachment-0001.html>
  • Chris Fuller at Dec 21, 2010 at 2:27 am
    This isn't XML, it's an abomination of XML. Best to not treat it as XML.
    Good thing you're only after one class of tags. Here's what I'd do. I'll
    give a general solution, but there are two parameters / four cases that could
    make the code simpler, I'll just point them out at the end.

    Iterate over the file descriptor, reading in line-by-line. This will be slow
    on a huge file, but probably not so bad if you're only doing it once. It makes
    the rest easier. Knuth has some sage advice on this point (*) :) Some
    feedback on progress to the user can be helpful here, if it is slow.

    Keep track of your offset into the file. There are two ways: use the tell()
    method of the file descriptor (but you will have to subtract the length of the
    current line), or just add up the line lengths as you process them.

    Scan each line for the open tag. Add the offset to the tag to the offset within
    the file of the current line, and push that to a stack. Scan for the end tag,
    when you find one, pop an address from the stack, and put the two (start/end)
    addresses a list for later. Keep doing this until you run out of file.

    Now, take that list, and pull off the address-pairs; seek() and read() them
    directly. Lather, rinse, repeat.

    Some off-the-cuff untested code:

    stk = []
    yummydataaddrs = []

    fileoff = 0

    fd = open('ginormous.xml', 'r')
    for line in fd:
    lineoff = line.index(start_tag)
    if fileoff != -1:
    stk.append(fileoff+lineoff)

    lineoff = line.index(end_tag)
    if lineoff != -1:
    yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) )

    fileoff += len(line)

    for start,end in yummydataaddrs:
    fd.seek(start)
    print "here's your stupid data:", fd.read(end-start+1)


    You can simplify a bit if the tags are one a line by themselves, since you
    don't have to keep track of the offset with the line of the tag. The other
    simplification is if they aren't nested. You don't need to mess around with a
    stack in this case.


    (*) "Premature optimization is the root of all evil."


    Cheers
  • Ashish makani at Dec 21, 2010 at 4:11 am
    Chris

    This block of code made my day - especially yummydataaddrs & "here's your
    stupid data"
    for start,end in yummydataaddrs:
    fd.seek(start)
    print "here's your stupid data:", fd.read(end-start+1)

    Nothing is more impressive than solid code, with a good sense of humor.

    Thanks for the code & especially since i am in a time crunch, this approach,
    might get me what i need more quickly.

    Thanks also for Knuth's awesome quote & reminded me of my stanford friend
    who told me that Prof. Knuth, still holds a christmas tree lecture every
    year...unfortunately inspite of being in the bay area this year, i missed it
    :(
    http://stanford-online.stanford.edu/seminars/knuth/101206-knuth-500.asx

    Thanks a ton

    cheers
    ashish

    p.s. To everybody

    OT(off_topic): I moved to the bay area recently & am passionate about
    technology in general & linux, python, c, embedded, mobile, wireless
    stuff,.....
    I was wondering if any of you guys, are part of some bay area python( or
    other tech) meetup ( as in do you guys meetup, in person) for like a tech
    talk / discussion / brainstorming/ hack nights ?
    If yes, i would love to know more & be a part of it
    On Mon, Dec 20, 2010 at 9:27 PM, Chris Fuller wrote:


    This isn't XML, it's an abomination of XML. Best to not treat it as XML.
    Good thing you're only after one class of tags. Here's what I'd do. I'll
    give a general solution, but there are two parameters / four cases that
    could
    make the code simpler, I'll just point them out at the end.

    Iterate over the file descriptor, reading in line-by-line. This will be
    slow
    on a huge file, but probably not so bad if you're only doing it once. It
    makes
    the rest easier. Knuth has some sage advice on this point (*) :) Some
    feedback on progress to the user can be helpful here, if it is slow.

    Keep track of your offset into the file. There are two ways: use the
    tell()
    method of the file descriptor (but you will have to subtract the length of
    the
    current line), or just add up the line lengths as you process them.

    Scan each line for the open tag. Add the offset to the tag to the offset
    within
    the file of the current line, and push that to a stack. Scan for the end
    tag,
    when you find one, pop an address from the stack, and put the two
    (start/end)
    addresses a list for later. Keep doing this until you run out of file.

    Now, take that list, and pull off the address-pairs; seek() and read() them
    directly. Lather, rinse, repeat.

    Some off-the-cuff untested code:

    stk = []
    yummydataaddrs = []

    fileoff = 0

    fd = open('ginormous.xml', 'r')
    for line in fd:
    lineoff = line.index(start_tag)
    if fileoff != -1:
    stk.append(fileoff+lineoff)

    lineoff = line.index(end_tag)
    if lineoff != -1:
    yummydataaddr.append( (stk.pop(-1), fileoff+lineoff) )

    fileoff += len(line)

    for start,end in yummydataaddrs:
    fd.seek(start)
    print "here's your stupid data:", fd.read(end-start+1)


    You can simplify a bit if the tags are one a line by themselves, since you
    don't have to keep track of the offset with the line of the tag. The other
    simplification is if they aren't nested. You don't need to mess around
    with a
    stack in this case.


    (*) "Premature optimization is the root of all evil."


    Cheers
    _______________________________________________
    Tutor maillist - Tutor at python.org
    To unsubscribe or change subscription options:
    http://mail.python.org/mailman/listinfo/tutor


    *"We act as though comfort and luxury were the chief requirements of life,
    when all that we need to make us happy is something to be enthusiastic
    about."
    -- Albert Einstein*
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/tutor/attachments/20101220/531dc1c8/attachment.html>
  • Stefan Behnel at Dec 21, 2010 at 8:52 am

    Chris Fuller, 21.12.2010 03:27:
    This isn't XML, it's an abomination of XML. Best to not treat it as XML.
    Good thing you're only after one class of tags. Here's what I'd do. I'll
    give a general solution, but there are two parameters / four cases that could
    make the code simpler, I'll just point them out at the end.

    Iterate over the file descriptor, reading in line-by-line. This will be slow
    on a huge file, but probably not so bad if you're only doing it once.
    Note that it's not unlikely that this is actually *slower* than using a
    real XML parser:

    http://effbot.org/zone/celementtree.htm#benchmarks

    Stefan
  • David Hutto at Dec 21, 2010 at 8:55 am

    On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
    Chris Fuller, 21.12.2010 03:27:
    This isn't XML, it's an abomination of XML. ?Best to not treat it as XML.
    Good thing you're only after one class of tags. ?Here's what I'd do. ?I'll
    give a general solution, but there are two parameters / four cases that
    could
    make the code simpler, I'll just point them out at the end.

    Iterate over the file descriptor, reading in line-by-line. ?This will be
    slow
    on a huge file, but probably not so bad if you're only doing it once.
    Note that it's not unlikely that this is actually *slower* than using a real
    XML parser:
    Or a 'real' language like C or C++ maybe to increase, or in Python's
    case, bypass, the interpreter?

  • David Hutto at Dec 21, 2010 at 8:56 am

    On Tue, Dec 21, 2010 at 3:55 AM, David Hutto wrote:
    On Tue, Dec 21, 2010 at 3:52 AM, Stefan Behnel wrote:
    Chris Fuller, 21.12.2010 03:27:
    This isn't XML, it's an abomination of XML. ?Best to not treat it as XML.
    Good thing you're only after one class of tags. ?Here's what I'd do. ?I'll
    give a general solution, but there are two parameters / four cases that
    could
    make the code simpler, I'll just point them out at the end.

    Iterate over the file descriptor, reading in line-by-line. ?This will be
    slow
    on a huge file, but probably not so bad if you're only doing it once.
    Note that it's not unlikely that this is actually *slower* than using a real
    XML parser:
    Or a 'real' language like C or C++ maybe to increase, or in Python's
    case, bypass, the interpreter?

    Which is *faster*.


    --
    They're installing the breathalyzer on my email account next week.
  • David Hutto at Dec 21, 2010 at 8:59 am
    And from what I recall XML is intended for data transfer in respect to
    HTML(from a recent brushup, nothing more), so not having used it, it
    sure has been displayed as a data transfer mechanism, I remember this
    from using Joomla's framework, and the xml files for menus I think.
  • David Hutto at Dec 21, 2010 at 9:06 am

    On Tue, Dec 21, 2010 at 3:59 AM, David Hutto wrote:
    And from what I recall XML is intended for data transfer in respect to
    HTML(from a recent brushup, nothing more),
    Apologies that is browser based transfer, (not sure what more,
    although I think it means any data tranfer)

    so not having used it, it
    sure has been displayed as a data transfer mechanism, I remember this
    from using Joomla's framework, and the xml files for menus I think.
  • Alan Gauld at Dec 21, 2010 at 9:46 am
    "David Hutto" <smokefloat at gmail.com> wrote
    And from what I recall XML is intended for data transfer in respect
    to
    HTML(from a recent brushup, nothing more),
    Apologies that is browser based transfer,
    I'm not sure what that last bit means.
    XML is a self-describing data format. It is usually used for files
    but can be used in data streams or in-memory strings.

    It's natural competitors are TLV (Tag,Lenth,Value) and
    CSV(Comma Seperated Value) files but neither is as rich
    in structure. Alternative options include ASN.1, Edifact and
    IDL but these are not self-describing(*) (although they are all
    more compact and faster to parse, but only IDL is free.)
    sure has been displayed as a data transfer mechanism,
    You don't have to use it for data transfer - eg MS's use
    as a document storage format in Office - but frankly if
    you use XML to store large volumes of data you are mad,
    a database is a much more sensible option being far more
    space efficient and faster to work with.

    (*)ASN.1, IDL etc all rely on a shared definition, and
    often shared code library, at both sender and receiver.
    The library is a compiled version of the data definition
    which enables complex data structures to be read from
    the file in a single chunk very efficiently.

    HTH,


    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/
  • David Hutto at Dec 21, 2010 at 10:00 am

    On Tue, Dec 21, 2010 at 4:46 AM, Alan Gauld wrote:
    "David Hutto" <smokefloat at gmail.com> wrote
    And from what I recall XML is intended for data transfer in respect to
    HTML(from a recent brushup, nothing more),
    Apologies that is browser based transfer,
    I'm not sure what that last bit means.
    XML is a self-describing data format. It is usually used for files
    but can be used in data streams or in-memory strings.
    I know it's self tagged, meaning you create the tags within, and that
    it's used elsewhere as a form of data transfer, my previous usage with
    the particular file format was browser based in usage, but I know it's
    used in many other places, which is why I didn't see the meaning of
    the discussion saying it was horrible to use, I just asked for any
    alternative suggestions for files, since everyone 'seemed' to have a
    bad view of the usage, since it seems to be the standard for user
    defined tags for data transfer.


    It's natural competitors are TLV (Tag,Lenth,Value) and
    CSV(Comma Seperated Value) files but neither is as rich
    in structure.
    That was kind of my point, I've seen all but TLV in use, but XML is
    the web standard it seems.


    Alternative options include ASN.1, Edifact and
    IDL but these are not self-describing(*) (although they are all
    more compact and faster to parse, but only IDL is free
    Haven't heard of these, but formula of file, it seems to me,
    is encoding + extension + text, how much can these really differ.
    On average it seems that the self defined tags of xml, would have a
    bigger impact on the average usage(someone has larger tag sizes, and
    more tags) than a defined file with averaged tags.
    sure has been displayed as a data transfer mechanism,
    You don't have to use it for data transfer - eg MS's use
    as a document storage format in Office - but frankly if
    you use XML to store large volumes of data you are mad,
    a database is a much more sensible option being far more
    space efficient and faster to work with.
    If truly optimizing, I would time both, and maybe move to a different
    language, or pattern if it truly mattered.
    (*)ASN.1, IDL etc all rely on a shared definition, and
    often shared code library, at both sender and receiver.
    The library is a compiled version of the data definition
    which enables complex data structures to be read from
    the file in a single chunk very efficiently.
    This I might have to work on, but I rely on experience to quasi-trust
    experience.
  • Alan Gauld at Dec 21, 2010 at 10:30 am
    "David Hutto" <smokefloat at gmail.com> wrote
    (*)ASN.1, IDL etc all rely on a shared definition, and
    often shared code library, at both sender and receiver.
    This I might have to work on, but I rely on experience to
    quasi-trust
    experience.
    These are all data transport formats agreed and standardised
    long before XML appeared. IDL is the format used in COM calls
    for example and RPC calls between processes on an OS or
    across a network. It is an OpenGroup standard I believe.

    ASN.1 is a binary form and used in eCommerce and telecomms
    networks for many years. It is standardised by the ITU

    Edifact is the data standard of EDI and is set by the UN.
    It has been used for commercial trading between large corporates
    for many years.

    All of these standards developed when network bandwidth
    was very expensive so they all major on efficiency. XML was
    developed by non networks-oriented people for the ease of
    writing software for the web. Bandwidth was not a primary
    concern to them.

    There are other formats too, because the problem of transporting
    data portably between computers has been with us since the
    dawn of networking. XML just happens to be the most popular
    format today. But popularity doesn't necessarily mean its good. :-)

    --
    Alan Gauld
    Author of the Learn to Program web site
    http://www.alan-g.me.uk/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouptutor @
categoriespython
postedDec 20, '10 at 8:08p
activeDec 22, '10 at 8:19a
posts83
users11
websitepython.org

People

Translate

site design / logo © 2022 Grokbase