FAQ
Hi,

I think my question is a little bit off topic... sorry.

I'm working with a large file of network logs (a file of 25Mb or 150000
lines). What I do is parse the file, identify the fields that interest
me most, put them on lists and then play with the lists to print out the
information I want.

This takes 5 minutes and eats a lot of RAM (150000 list elements is
quite a lot :-) so I though about creating a database with all the fiels
I want. Thinking about the database I'm studying the possibility of
using a classical MySQL database or using an XML document as a database
using XPath. The XML possibility would allow me to run my program in any
computer with python installed and eliminates the need of database
installation and management.

My questions are:
1- Would I gain in speed and RAM with the use of a database?
2- If it's the case, would XML resist the search over 150000 entries?
3- How much would I loose in speed with an XML document compared to the
MySQL database?

This is, of course, implemented in python :-)
I'm very new to databases and XMl.

Thanks!

Guille

Search Discussions

  • Jorge Godoy at Sep 29, 2003 at 10:34 am

    Guillermo Fernandez writes:

    1- Would I gain in speed and RAM with the use of a database?
    2- If it's the case, would XML resist the search over 150000 entries?
    3- How much would I loose in speed with an XML document compared to
    the MySQL database?
    Depending on how you're going to traverser the XML tree, I think you'd
    end using more RAM than you're using now.

    The RDBMS solution (you can use sqlite for that, there's a pysqlite
    binding --- it's a file based database, very fast and requires no
    maintenance work) seems better to me (but we already use databases for
    other projects here)... It is also faster and can simplify a lot of
    the work if it's done on the server instead of on the client (again,
    we use PostgreSQL and I don't know if MySQL support stored
    procedures and triggers).


    See you,
    --
    Godoy. <godoy@metalab.unc.edu>
  • Erik Price at Sep 29, 2003 at 2:46 pm

    On Monday, September 29, 2003, at 08:16AM, Guillermo Fernandez wrote:
    I'm working with a large file of network logs (a file of 25Mb or 150000
    lines). What I do is parse the file, identify the fields that interest
    me most, put them on lists and then play with the lists to print out the
    information I want.

    This takes 5 minutes and eats a lot of RAM (150000 list elements is
    quite a lot :-) so I though about creating a database with all the fiels
    I want. Thinking about the database I'm studying the possibility of
    using a classical MySQL database or using an XML document as a database
    using XPath. The XML possibility would allow me to run my program in any
    computer with python installed and eliminates the need of database
    installation and management.

    My questions are:
    1- Would I gain in speed and RAM with the use of a database?
    Yes, but if you are using a relational database server such as PostgreSQL or MySQL, then you have the overhead of the database itself, which may occupy additional RAM etc. Plus there is usually a certain level of installation issues to work through, unless you have a database already installed and you will simply be creating a new DB instance.
    2- If it's the case, would XML resist the search over 150000 entries?
    XML parsing isn't super hard, but it's not easy unless you're working with some library or other that abstracts away all of the details into programmer-friendly API (for instance, the Jakarta Digester libraries, but that's Java and not Python). For one thing, if you build an in-memory tree of the logs using a DOM-based or other in-memory -based model, you will probably incur a similar or perhaps even greater overhead. However, if you implement a SAX listener and simply parse through the file and perform callbacks as you do so, then you will use very little RAM. But this would really only be useful if your logs were already in an XML format, or if you were converting the logs from their original format to XML.

    In all, the XML-based solution doesn't require a database, but can be more work to implement in the end depending on how complex your needs are. And a DOM-based solution will not spare you much RAM since it uses the same model as your current approach.
    3- How much would I loose in speed with an XML document compared to the
    MySQL database?
    You can't get numbers for this easily without your own benchmarking. It's too specific to your problem/hardware.

    Perhaps the best approach would be to write some kind of parser (not necessarily XML-based) to read through your log files in the same fashion as a SAX parser, performing callbacks as it goes, without actually reading everything into memory. If each entry is on a separate line, you can use the xreadlines method of the file object to do this.



    Erik
  • Guillermo Fernandez at Sep 29, 2003 at 7:58 pm
    Hi,

    Thanks for the answers. Everything seems to indicate that XML does not
    answer my problems, as I was thinking in using DOM-based model.
    Perhaps the best approach would be to write some kind of parser (not
    necessarily XML-based) to read through your log files in the same
    fashion as a SAX parser, performing callbacks as it goes, without
    actually reading everything into memory. If each entry is on a
    separate line, you can use the xreadlines method of the file object to
    do this.
    I already programed a parser that reads the file line by line with
    xreadlines, but the dataprocess I need implies some kind of data
    storage, either in a database or in memory as I do it now (and one of
    the problems of my disgrace ;-)
    The RDBMS solution (you can use sqlite for that, there's a pysqlite
    binding --- it's a file based database, very fast and requires no
    maintenance work) seems better to me (but we already use databases for
    other projects here)... It is also faster and can simplify a lot of
    the work if it's done on the server instead of on the client
    As I'm working only with the log files, I've no server-client problems
    to take into accoount :-)

    I had a
    look to the pysqlite module, and seems to better fit my needs.In the
    docs they say sqlite "fully comply with the Python Database API
    v.2.0 specification" and give no further details about pysqlite use. I
    had a look into the python library reference and there seems to be no
    "standard" database module. All those different database modules are
    quite confusing! Is there any tutorial for using databases with the
    Database API specification? Or describing this specification?

    Thanks,

    Guille
  • Danny Yoo at Sep 30, 2003 at 1:04 pm

    I had a look to the pysqlite module, and seems to better fit my needs.In
    the docs they say sqlite "fully comply with the Python Database API
    v.2.0 specification" and give no further details about pysqlite use. I
    had a look into the python library reference and there seems to be no
    "standard" database module. All those different database modules are
    quite confusing! Is there any tutorial for using databases with the
    Database API specification? Or describing this specification?
    Hi Guillermo,


    Yes, there's documentation on the Database API 2.0, starting from the
    'Database' topic guide:

    http://python.org/topics/database/

    We can find the API here:

    http://python.org/peps/pep-0249.html

    (It might be nice for the Sqlite folks to directly hyperlink the Database
    API link into their documentation. That should reduce the confusion for
    anyone else who's using Sqlite for the first time.)



    Examples, examples... ok, Linux Journal has written an article that shows
    how to use the API:

    http://www.linuxjournal.com/article.php?sid&05

    There's another example from the Devshed folks:

    http://www.devshed.com/Server_Side/Python/PythonMySQL/

    We can also talk about examples on the Tutor list, if you'd like.



    If you have some really hefty database questions, there's a dedicated
    Special Interest Group for Python and databases:

    http://mail.python.org/mailman/listinfo/db-sig

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouptutor @
categoriespython
postedSep 29, '03 at 8:16a
activeSep 30, '03 at 1:04p
posts5
users4
websitepython.org

People

Translate

site design / logo © 2022 Grokbase