FAQ
Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

Thanks,
Alex

Search Discussions

  • Diez B. Roggisch at Feb 2, 2008 at 8:51 pm

    agc schrieb:
    Hi,

    I'm looking for a fast way of accessing some simple (structured) data.

    The data is like this:
    Approx 6 - 10 GB simple XML files with the only elements
    I really care about are the <title> and <article> ones.

    So what I'm hoping to do is put this data in a format so
    that I can access it as fast as possible for a given request
    (http request, Python web server) that specifies just the title,
    and I return the article content.

    Is there some good format that is optimized for search for
    just 1 attribute (title) and then returning the corresponding article?

    I've thought about putting this data in a SQLite database because
    from what I know SQLite has very fast reads (no network latency, etc)
    but not as fast writes, which is fine because I probably wont be doing
    much writing (I wont ever care about the speed of any writes).

    So is a database the way to go, or is there some other,
    more specialized format that would be better?

    Database it is. Make sure you have proper indexing.


    Diez
  • John Machin at Feb 2, 2008 at 9:50 pm

    agc wrote:
    Hi,

    I'm looking for a fast way of accessing some simple (structured) data.

    The data is like this:
    Approx 6 - 10 GB simple XML files with the only elements
    I really care about are the <title> and <article> ones.

    So what I'm hoping to do is put this data in a format so
    that I can access it as fast as possible for a given request
    (http request, Python web server) that specifies just the title,
    and I return the article content.

    Is there some good format that is optimized for search for
    just 1 attribute (title) and then returning the corresponding article?

    I've thought about putting this data in a SQLite database because
    from what I know SQLite has very fast reads (no network latency, etc)
    but not as fast writes, which is fine because I probably wont be doing
    much writing (I wont ever care about the speed of any writes).

    So is a database the way to go, or is there some other,
    more specialized format that would be better?
    "Database" without any further qualification indicates exact matching,
    which doesn't seem to be very practical in the context of titles of
    articles. There is an enormous body of literature on inexact/fuzzy
    matching, and lots of deployed applications -- it's not a Python-related
    question, really.
  • Agc at Feb 3, 2008 at 4:42 am

    On Feb 2, 1:50 pm, John Machin wrote:
    agc wrote:
    Hi,
    I'm looking for a fast way of accessing some simple (structured) data.
    The data is like this:
    Approx 6 - 10 GB simple XML files with the only elements
    I really care about are the <title> and <article> ones.
    So what I'm hoping to do is put this data in a format so
    that I can access it as fast as possible for a given request
    (http request, Python web server) that specifies just the title,
    and I return the article content.
    Is there some good format that is optimized for search for
    just 1 attribute (title) and then returning the corresponding article?
    I've thought about putting this data in a SQLite database because
    from what I know SQLite has very fast reads (no network latency, etc)
    but not as fast writes, which is fine because I probably wont be doing
    much writing (I wont ever care about the speed of any writes).
    So is a database the way to go, or is there some other,
    more specialized format that would be better?
    "Database" without any further qualification indicates exact matching,
    which doesn't seem to be very practical in the context of titles of
    articles. There is an enormous body of literature on inexact/fuzzy
    matching, and lots of deployed applications -- it's not a Python-related
    question, really.
    Yes, you are right that in some sense this question is not truly
    Python related,
    but I am looking to solve this problem in a way that plays as nicely
    as
    possible with Python:

    I guess an important feature of what I'm looking for is
    some kind of mapping from *exact* title to corresponding article,
    i.e. if my data set wasn't so large, I would just keep all my
    data in a in-memory Python dictionary, which would be very fast.

    But I have about 2 million article titles mapping to approx. 6-10 GB
    of article bodies, so I think this would be just to big for a
    simple Python dictionary.

    Does anyone have any advice on the feasibility of using
    just an in memory dictionary? The dataset just seems to big,
    but maybe there is a related method?

    Thanks,
    Alex
  • Stefan Behnel at Feb 3, 2008 at 5:41 pm

    agc wrote:
    I guess an important feature of what I'm looking for is
    some kind of mapping from *exact* title to corresponding article,
    i.e. if my data set wasn't so large, I would just keep all my
    data in a in-memory Python dictionary, which would be very fast.

    But I have about 2 million article titles mapping to approx. 6-10 GB
    of article bodies, so I think this would be just to big for a
    simple Python dictionary.
    Then use a database table that maps titles to articles, and make sure you
    create an index over the title column.

    Stefan
  • M.-A. Lemburg at Feb 2, 2008 at 11:20 pm

    On 2008-02-02 21:36, agc wrote:
    Hi,

    I'm looking for a fast way of accessing some simple (structured) data.

    The data is like this:
    Approx 6 - 10 GB simple XML files with the only elements
    I really care about are the <title> and <article> ones.

    So what I'm hoping to do is put this data in a format so
    that I can access it as fast as possible for a given request
    (http request, Python web server) that specifies just the title,
    and I return the article content.

    Is there some good format that is optimized for search for
    just 1 attribute (title) and then returning the corresponding article?

    I've thought about putting this data in a SQLite database because
    from what I know SQLite has very fast reads (no network latency, etc)
    but not as fast writes, which is fine because I probably wont be doing
    much writing (I wont ever care about the speed of any writes).

    So is a database the way to go, or is there some other,
    more specialized format that would be better?
    Depends on what you want to search and how, e.g. whether
    a search for title substrings should give results, whether
    stemming is needed, etc.

    If all you want is a simple mapping of full title to article
    string, an on-disk dictionary is probably the way to go,
    e.g. mxBeeBase (part of the eGenix mx Base Distribution).

    For more complex search, you're better off with a tool that
    indexes the titles based on words, ie. a full-text search
    engine such as Lucene.

    Databases can also handle this, but they often have problems when
    it comes to more complex queries where their indexes no longer
    help them to speed up the query and they have to resort to
    doing a table scan - a sequential search of all rows.

    Some databases provide special full-text extensions, but
    those are of varying quality. Better use a specialized
    tool such as Lucene for this.

    For more background on the problems of full-text search, see e.g.

    http://www.ibm.com/developerworks/opensource/library/l-pyind.html

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Feb 03 2008)
    Python/Zope Consulting and Support ... http://www.egenix.com/
    mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
    ________________________________________________________________________

    :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
  • Ivan Illarionov at Feb 3, 2008 at 1:55 pm

    Is there some good format that is optimized for search for
    just 1 attribute (title) and then returning the corresponding article?
    I would use Durus (http://www.mems-exchange.org/software/durus/) -
    simple pythonic object database - and store this data as persistent
    python dict with Title keys and Article values.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedFeb 2, '08 at 8:36p
activeFeb 3, '08 at 5:41p
posts7
users6
websitepython.org

People

Translate

site design / logo © 2022 Grokbase