FAQ
Hi all,

I'm thinking of using xapian to index my growing mail collection, which
I suppose is a reasonably common use-case. (The fact that gmane has
used xapian to index some tens of millions of emails is what piqued my
interest.)

First I want to start off with how very impressed I've been with Xapian.
The API is nice, feature-complete bindings are available for my
preferred language, it's very fast, and it generally Just Works exactly
how I expect. It's very good quality software.

One of the features I'd like to have is gmail-like tagging of threads.
This means I need to be able to add and remove tags to arbitrarily many
documents after they have been indexed.

Unfortunately, xapian is quite slow at updating many documents.
Updating 10k documents by adding an XTAGfoo term takes 40 seconds
(including db flush) on my system. One use-case for updating this many
documents is, for example, applying a tag to existing emails in a
mailing list.

So, I am looking into an alternative approach. I'm already maintaining
a table of all threads (gmail would call them conversations) in an sql
database. All emails in the xapian database are given an XTID<n> term
corresponding to the thread id in the sql database. If I want to search
for all documents with the word "foo" and with tag "bar", I figured I'd
first look up all thread ids with tag "bar" in SQL, and then use those
ids as filters with the Xapian search. So the Xapian query might look
like:

(foo) (tid:0 tid:1 tid:2 tid:3)

Where tid: is prefix for boolean term XTID. This translates to:

Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR
XTID3)))


This works fine until I start to add many XTID terms, and then I start
to notice that query time increases (about quadratically it seems) with
the number of terms. This surprised me, because a search for "foo" by
itself finishes in 0.0005 seconds with all matches fetched. I expected
the XTID terms to apply as a filter, and therefore be no worse than the
time it takes for the query to complete (with all matches fetched)
without the XTID terms.

Indeed, if the result set for query "foo" is sufficiently low (maybe a
few thousand documents), it's actually faster for me to iterate over
them all and manually check the XTID values myself.

Is there any way to improve Xapian with this type of search?

I've also observed that if I put "OR" between the tid: terms, it
actually generates a different query that executes in less than half the
time than without "OR":

(foo) (tid:0 OR tid:1 OR tid:2 OR tid:3)

becomes:

Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 *
XTID2 OR 0 * XTID3)))

This was another surprising result, because I understood that "OR" was
implicit when multiple boolean terms were specified, so I didn't expect
a different query to be produced.

Lastly, if I remove the parens, a different query still is generated:

(foo) tid:0 tid:1 tid:2 tid:3

becomes:

Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR
TID3)))

Unfortunately this has about the same performance as the first form.

Even with the fastest form (tid:0 OR tid:1 OR tid:2 [...]) it's still a
bit disappointing. Hopefully someone can shed some insight as to what's
happening, and how I might improve the performance.

Thanks!
Jason.

Search Discussions

  • James Aylett at Oct 5, 2009 at 6:27 pm

    On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:

    (foo) (tid:0 tid:1 tid:2 tid:3)
    What are the brackets for? The '*' in the output is, I think,
    OP_SCALE_WEIGHT, which doesn't seem a good query structure for what
    you're trying to do.

    You actually want a FILTER query at top level; which you've figured
    out how to generate. I don't know why this isn't behaving as fast as
    you want (some tabulated figures on this, with information about the
    corpus size and so on might help others respond here).

    J

    --
    James Aylett

    talktorex.co.uk - xapian.org - uncertaintydivision.org
  • Olly Betts at Oct 5, 2009 at 8:50 pm

    On Mon, Oct 05, 2009 at 07:27:38PM +0100, James Aylett wrote:
    On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:

    (foo) (tid:0 tid:1 tid:2 tid:3)
    What are the brackets for? The '*' in the output is, I think,
    OP_SCALE_WEIGHT, which doesn't seem a good query structure for what
    you're trying to do.
    Yes, "*" means OP_SCALE_WEIGHT.
    You actually want a FILTER query at top level; which you've figured
    out how to generate. I don't know why this isn't behaving as fast as
    you want (some tabulated figures on this, with information about the
    corpus size and so on might help others respond here).
    As of 1.0.4, any scale factor is pushed down to the leaf level by the
    query optimiser when it builds the postlist tree, and in 1.1.x it is
    pushed into the weighting scheme. So the different query
    representations aren't *necessarily* actually executed differently. In
    particular, OP_FILTER should be executed the same as OP_AND with
    OP_SCALE_WEIGHT 0 on one branch.

    But I've not had time to look at this in detail yet I'm afraid.

    Cheers,
    Olly
  • Olly Betts at Oct 6, 2009 at 10:00 am

    On Sat, Oct 03, 2009 at 09:15:48PM -0400, Jason Tackaberry wrote:
    Unfortunately, xapian is quite slow at updating many documents.
    Updating 10k documents by adding an XTAGfoo term takes 40 seconds
    (including db flush) on my system. One use-case for updating this many
    documents is, for example, applying a tag to existing emails in a
    mailing list.
    Currently we rewrite all the terms if any are changed, which is why it
    is a bit slow. This could be improved:

    http://trac.xapian.org/ticket/250

    So one approach would just be to live with the 40 seconds for now and
    you'll get a nice speed up when that optimisation is implemented.
    So, I am looking into an alternative approach. I'm already maintaining
    a table of all threads (gmail would call them conversations) in an sql
    database. All emails in the xapian database are given an XTID<n> term
    corresponding to the thread id in the sql database. If I want to search
    for all documents with the word "foo" and with tag "bar", I figured I'd
    first look up all thread ids with tag "bar" in SQL, and then use those
    ids as filters with the Xapian search. So the Xapian query might look
    like:

    (foo) (tid:0 tid:1 tid:2 tid:3)
    Note that you really shouldn't build up query strings mechanically to
    pass to the QueryParser. It doesn't have a rigidly defined syntax, but
    is really a heuristic parser aiming to handle potentially poorly formed
    human written input (and those heuristics can change between releases)
    so it may not parse your built query as you intended, or may stop
    doing so.

    Instead parse "foo" and then append the filters using the API:

    // Assuming the tids are listed in vector<string> tids:
    Xapian::Query filter(Xapian::Query::OP_OR, tids.begin(), tids.end());
    Xapian::Query q = qp->parse_query("foo");
    q = Xapian::Query(Xapian::Query::OP_FILTER, q, filter);
    Where tid: is prefix for boolean term XTID. This translates to:

    Xapian::Query((Zfoo:(pos=1) AND 0 * (XTID0 OR XTID1 OR XTID2 OR
    XTID3)))

    This works fine until I start to add many XTID terms, and then I start
    to notice that query time increases (about quadratically it seems) with
    the number of terms. This surprised me, because a search for "foo" by
    itself finishes in 0.0005 seconds with all matches fetched. I expected
    the XTID terms to apply as a filter, and therefore be no worse than the
    time it takes for the query to complete (with all matches fetched)
    without the XTID terms.
    Well the filter part will take time to evaluate, but will potentially
    reduce the work required for the probabilistic part. So it's not
    totally clear whether it will be fast with or without the filter,
    but I/O is usually the limiting factor (having to read a single block
    from disk takes many many CPU cycles) so I'd probably guess the filtered
    version would be slower if there are a lot of "XTID" terms in it.
    Indeed, if the result set for query "foo" is sufficiently low (maybe a
    few thousand documents), it's actually faster for me to iterate over
    them all and manually check the XTID values myself.
    But this looks odd to me. When you say "query time increases", what
    are you actually timing here? If it includes the query parsing time
    then I suspect the quadratic behaviour is probably there.

    Pair-wise construction of a long query is still quadratic:

    http://trac.xapian.org/ticket/273

    We've worked around this in places in the QueryParser by putting
    subqueries in a container and building a query from that, but it's
    quite possible this can still happen in some cases. Generally
    humans don't enter queries with more than a dozen terms, so quadratic
    behaviour usually doesn't kick in and so can easy escape notice.

    I'll try to produce a simple testcase later today.
    I've also observed that if I put "OR" between the tid: terms, it
    actually generates a different query that executes in less than half the
    time than without "OR":

    (foo) (tid:0 OR tid:1 OR tid:2 OR tid:3)

    becomes:

    Xapian::Query((Zfoo:(pos=1) AND (0 * XTID0 OR 0 * XTID1 OR 0 *
    XTID2 OR 0 * XTID3)))
    This is just a different representation with the same meaning as the
    query above and should result in exactly the same internal
    representation in the matcher and so have exactly the same performance.
    If it's much faster, that suggests that this is a parsing issue.
    Lastly, if I remove the parens, a different query still is generated:

    (foo) tid:0 tid:1 tid:2 tid:3

    becomes:

    Xapian::Query((Zfoo:(pos=1) FILTER (XTID0 OR XTID1 OR XTID2 OR
    TID3)))
    This one should also result in the same internal representation in the
    matcher.

    Cheers,
    Olly
  • Jason Tackaberry at Oct 7, 2009 at 6:06 pm
    Hi Olly,
    On Tue, 2009-10-06 at 11:00 +0100, Olly Betts wrote:
    Currently we rewrite all the terms if any are changed, [...]
    So one approach would just be to live with the 40 seconds for now and
    you'll get a nice speed up when that optimisation is implemented.
    Is most of that rewriting work done at flush time? Because even
    excluding flush, currently updating 1000 documents takes about 3 seconds
    (on my system with a 200k document database). That's still quite a bit
    slower than I'd like. Do you anticipate any performance improvement
    with replace_document() as well, or just flush()?

    But this looks odd to me. When you say "query time increases", what
    are you actually timing here? If it includes the query parsing time
    then I suspect the quadratic behaviour is probably there.
    You're right on the money here. I was measuring parse time as well. I
    had the flawed intuition that parse time should be negligible compared
    to search time so ended up including it in my timing calculation.

    Indeed, when I construct the Query programmatically as you suggested,
    the performance win is significant. The whole approach looks quite
    feasible now, and your explanation makes good sense. Thank you.

    Cheers,
    Jason.
  • Olly Betts at Oct 8, 2009 at 12:07 am

    On Wed, Oct 07, 2009 at 02:06:49PM -0400, Jason Tackaberry wrote:
    On Tue, 2009-10-06 at 11:00 +0100, Olly Betts wrote:
    Currently we rewrite all the terms if any are changed,
    Is most of that rewriting work done at flush time? Because even
    excluding flush, currently updating 1000 documents takes about 3 seconds
    (on my system with a 200k document database). That's still quite a bit
    slower than I'd like. Do you anticipate any performance improvement
    with replace_document() as well, or just flush()?
    Unnecessary extra work happens both when replace_document() is called,
    and also when flush() is (or when it happens implicitly).
    But this looks odd to me. When you say "query time increases", what
    are you actually timing here? If it includes the query parsing time
    then I suspect the quadratic behaviour is probably there.
    You're right on the money here. I was measuring parse time as well. I
    had the flawed intuition that parse time should be negligible compared
    to search time so ended up including it in my timing calculation.
    (On the second attempt) I seem to have managed to write a testcase for
    this scaling problem. I'll investigate what's causing it.

    Cheers,
    Olly
  • Jason Tackaberry at Oct 8, 2009 at 2:30 pm

    On Thu, 2009-10-08 at 01:07 +0100, Olly Betts wrote:
    Unnecessary extra work happens both when replace_document() is called,
    and also when flush() is (or when it happens implicitly).
    Excellent. I noticed the target for issue #250 is 1.2.0. Is that still
    the current target, and if so, do you have any sense at this point
    roughly when this would be released?

    Also, do you yet have some intuition as to what kind of speed
    improvement should be expected?

    (On the second attempt) I seem to have managed to write a testcase for
    this scaling problem. I'll investigate what's causing it.
    For my purposes, constructing the XTID filter via the API rather than
    using the parser makes a lot more sense as well as being much faster,
    but I'm sure other people will benefit from your investigation. :)

    Thanks,
    Jason.
  • Olly Betts at Oct 8, 2009 at 4:29 pm

    On Thu, Oct 08, 2009 at 10:30:01AM -0400, Jason Tackaberry wrote:
    On Thu, 2009-10-08 at 01:07 +0100, Olly Betts wrote:
    Unnecessary extra work happens both when replace_document() is called,
    and also when flush() is (or when it happens implicitly).
    Excellent. I noticed the target for issue #250 is 1.2.0. Is that still
    the current target, and if so, do you have any sense at this point
    roughly when this would be released?
    The "1.2.0" milestone is currently a dumping ground for "stuff to
    probably be addressed in 1.2.x", so don't read too much into that.
    I've not looked at the pile recently, but I'd guess this is probably one
    of the ones likely to get done sooner.

    The 1.2.0 release was meant to be out in early September, but the last
    few issues dragged out and ran into my trip back to the UK to clear out
    and sell my house there, which has stalled progress for a while.
    Also, do you yet have some intuition as to what kind of speed
    improvement should be expected?
    I'd think pretty substantial. Work is broadly proportional to the
    number of terms being modified, though there may be some overhead
    in tracking what to change. But if you have a few hundred terms
    per document and only change one, you could easily see a 100 fold
    speed up.

    Cheers,
    Olly

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedOct 4, '09 at 1:15a
activeOct 8, '09 at 4:29p
posts8
users3
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2022 Grokbase