Hi

say I have 2 databases, DB1 and DB2

after i preformed a search over these 2 DBs, i have 1 result and I want to
delete this resulting doc, how do i identify which database (DB1 / DB2) this
document resides? and how to get its docid which is needed during the delete
process (delete process must over single writeableDatabase,right?)

Thanks
Andrey

Search Discussions

  • Olly Betts at Nov 22, 2007 at 12:11 am

    On Wed, Nov 21, 2007 at 01:44:11PM -0800, Andrey wrote:
    say I have 2 databases, DB1 and DB2

    after i preformed a search over these 2 DBs, i have 1 result and I want to
    delete this resulting doc, how do i identify which database (DB1 / DB2) this
    document resides? and how to get its docid which is needed during the delete
    process

    http://article.gmane.org/gmane.comp.search.xapian.general/1375
    (delete process must over single writeableDatabase,right?)
    Yes, WritableDatabase doesn't currently allow multiple subdatabases.

    Cheers,
    Olly
  • Andrey at Nov 22, 2007 at 12:43 am
    Very Nice, thanks

    did_raw = (did_merged - 1) / number_of_databases + 1

    offset = did_merged % number_of_databases

    Cheers
    Andrey




    "Olly Betts" <olly@survex.com> wrote in message
    news:20071122001120.GJ3839@survex.com...
    On Wed, Nov 21, 2007 at 01:44:11PM -0800, Andrey wrote:
    say I have 2 databases, DB1 and DB2

    after i preformed a search over these 2 DBs, i have 1 result and I want
    to
    delete this resulting doc, how do i identify which database (DB1 / DB2)
    this
    document resides? and how to get its docid which is needed during the
    delete
    process

    http://article.gmane.org/gmane.comp.search.xapian.general/1375
    (delete process must over single writeableDatabase,right?)
    Yes, WritableDatabase doesn't currently allow multiple subdatabases.

    Cheers,
    Olly
  • Kevin Duraj at Dec 14, 2007 at 7:40 am
    Andrey,

    Did you measure the performance loss by searching two databases
    instead of one database?
    And if, how much slower is to search two databases compare to one database ?


    _________________________________
    Kevin Duraj
    http://UncensoredWebSearch.com

    On Nov 21, 2007 4:43 PM, Andrey wrote:
    Very Nice, thanks

    did_raw = (did_merged - 1) / number_of_databases + 1

    offset = did_merged % number_of_databases

    Cheers
    Andrey




    "Olly Betts" <olly@survex.com> wrote in message
    news:20071122001120.GJ3839@survex.com...
    On Wed, Nov 21, 2007 at 01:44:11PM -0800, Andrey wrote:
    say I have 2 databases, DB1 and DB2

    after i preformed a search over these 2 DBs, i have 1 result and I want
    to
    delete this resulting doc, how do i identify which database (DB1 / DB2)
    this
    document resides? and how to get its docid which is needed during the
    delete
    process

    http://article.gmane.org/gmane.comp.search.xapian.general/1375
    (delete process must over single writeableDatabase,right?)
    Yes, WritableDatabase doesn't currently allow multiple subdatabases.

    Cheers,
    Olly



    _______________________________________________
    Xapian-discuss mailing list
    Xapian-discuss@lists.xapian.org
    http://lists.xapian.org/mailman/listinfo/xapian-discuss


    --
    Cheers
    __________________________________
    Kevin Duraj
    http://UncensoredWebSearch.com
  • Andrey at Dec 14, 2007 at 7:18 pm
    Kevin

    Unfortunately, I didn't have a chance to compare the data since I already
    break-up the db at beginning, in my xapian writer.

    The outcome of my 40M doc over 2 db (1 keep flushing 30mins) is still very
    good @below 1 sec...(my query is very complicated with lots of (a b c d)OR(f
    g h i g)AND_MAYBE_AND 4*(x x ). I think there isn't much performance lost by
    breaking up the db

    I will try to combine them and print out the results over these 2 scenarios
    and post it here when i able to..
    But i personally think the idea of breaking into multiple dbs has more gain
    than loss.
    easiler to handle / backup
    incase 1 corrupted, u still have somthing to serve
    base db(non-flusing) vs updating db(will flush), cache(warmup) of base db
    stays when flusing 2nd db. (i am not sure about this, just a guess :P)

    from my own experience, breaking up into dbs will not cause a big
    preformance lost, like from 1sec to 2 secs, it just works like querying 1 db
    after cached up
    maybe you can try to duplicate another copy of your db and serach over them
    together, its very easy with just 1 extra line
    db=db.add_database(xapian.Database(''db"))

    Andrey


    "Kevin Duraj" <kevin.softdev@gmail.com> wrote in message
    news:562be3af0712132340nb216e26re53fc70f4276bfb0@mail.gmail.com...
    Andrey,

    Did you measure the performance loss by searching two databases
    instead of one database?
    And if, how much slower is to search two databases compare to one database
    ?


    _________________________________
    Kevin Duraj
    http://UncensoredWebSearch.com

    On Nov 21, 2007 4:43 PM, Andrey wrote:
    Very Nice, thanks

    did_raw = (did_merged - 1) / number_of_databases + 1

    offset = did_merged % number_of_databases

    Cheers
    Andrey




    "Olly Betts" <olly@survex.com> wrote in message
    news:20071122001120.GJ3839@survex.com...
    On Wed, Nov 21, 2007 at 01:44:11PM -0800, Andrey wrote:
    say I have 2 databases, DB1 and DB2

    after i preformed a search over these 2 DBs, i have 1 result and I
    want
    to
    delete this resulting doc, how do i identify which database (DB1 /
    DB2)
    this
    document resides? and how to get its docid which is needed during the
    delete
    process

    http://article.gmane.org/gmane.comp.search.xapian.general/1375
    (delete process must over single writeableDatabase,right?)
    Yes, WritableDatabase doesn't currently allow multiple subdatabases.

    Cheers,
    Olly



    _______________________________________________
    Xapian-discuss mailing list
    Xapian-discuss@lists.xapian.org
    http://lists.xapian.org/mailman/listinfo/xapian-discuss


    --
    Cheers
    __________________________________
    Kevin Duraj
    http://UncensoredWebSearch.com
  • Olly Betts at Dec 18, 2007 at 11:49 am

    On Fri, Dec 14, 2007 at 11:18:12AM -0800, Andrey wrote:
    from my own experience, breaking up into dbs will not cause a big
    preformance lost, like from 1sec to 2 secs, it just works like querying 1 db
    after cached up
    I would be suprised if there was a large overhead - there's a bit of
    extra work from opening the databases, and a small amount from having
    a "MultiPostList". The combined size of the split databases is usually
    a little larger than the combined one, which may increase VM pressure a
    bit.

    If you do profile and find there's a significant difference, it would
    be interesting to see comparable profiles for the two cases to see where
    the extra time is spent.
    maybe you can try to duplicate another copy of your db and serach over them
    together, its very easy with just 1 extra line
    db=db.add_database(xapian.Database(''db"))
    You'd also need to generate the equivalent combined database (e.g. by
    using xapian-compact with the same input twice).

    But just duplicating the data isn't an accurate recreation of searching
    a real database split in two though. I don't know if it actually would
    make a difference, but it might.

    Cheers,
    Olly
  • Kevin Duraj at Dec 19, 2007 at 10:32 pm

    On Dec 18, 2007 3:49 AM, Olly Betts wrote:
    On Fri, Dec 14, 2007 at 11:18:12AM -0800, Andrey wrote:
    from my own experience, breaking up into dbs will not cause a big
    preformance lost, like from 1sec to 2 secs, it just works like querying 1 db
    after cached up
    We are all missing the points here. There are two types of Xapian users.

    1. Search engine using less than 1 million documents or data can be
    fit in memory.
    2. Search engine using 1-100 million documents and data is much larger
    than memory.

    People who are testing performance on data that can easily fit into
    server memory, their data is cashed in memory and their performance
    measurements is high and distorted. We must measure the performance
    when searches are not cashed to memory but sitting on hard disk. Only
    then we can see the real performance of searches as the hard disk
    spins and find the correct data. Than OS (Linux) place the result into
    cache if available. The second same search will use cache instead of
    hard disk and the performance is too high and invalid.

    Users of all search engines platforms are surprise that some searches
    takes very long, specially those that are not in cache. Because they
    run their performance on cache not on hard disk. Quickly they find
    their scalability problem and broken promises. In my case having
    100-500GB data on hard disk, the data cannot fit into memory and using
    two databases is two times slower than using single database. That is
    why I keep saying that indexing performance of single database is the
    most important, because the search performance follows.

    __________________________________
    Kevin Duraj
    http://UncensoredWebSearch.com


    I would be suprised if there was a large overhead - there's a bit of
    extra work from opening the databases, and a small amount from having
    a "MultiPostList". The combined size of the split databases is usually
    a little larger than the combined one, which may increase VM pressure a
    bit.

    If you do profile and find there's a significant difference, it would
    be interesting to see comparable profiles for the two cases to see where
    the extra time is spent.
    maybe you can try to duplicate another copy of your db and serach over them
    together, its very easy with just 1 extra line
    db=db.add_database(xapian.Database(''db"))
    You'd also need to generate the equivalent combined database (e.g. by
    using xapian-compact with the same input twice).

    But just duplicating the data isn't an accurate recreation of searching
    a real database split in two though. I don't know if it actually would
    make a difference, but it might.


    Cheers,
    Olly

    _______________________________________________
    Xapian-discuss mailing list
    Xapian-discuss@lists.xapian.org
    http://lists.xapian.org/mailman/listinfo/xapian-discuss
  • James Aylett at Dec 20, 2007 at 2:12 pm

    On Wed, Dec 19, 2007 at 02:32:52PM -0800, Kevin Duraj wrote:

    In my case having 100-500GB data on hard disk, the data cannot fit
    into memory and using two databases is two times slower than using
    single database.
    Are you spindle-restricted here? Just a thought.

    I don't actually know how the matcher deals with multiple databases
    right now, but I suspect it does it in a sort of pseudo-parallel [1],
    in which case putting two databases behind the same re-seek bottleneck
    is going to utterly destroy performance in a way that wouldn't happen
    if you laid out your data differently onto the available
    platters. Figuring out the profile of this kind of thing is a pain,
    because you often have to write your own analysis tools :-/

    [1] I'm sure Olly or Richard can jump in here, but I'm assuming this
    because if you fill up the candidate mset from both databases
    concurrently then I think you're *probably* going to run for less
    time, because your minimum-weight to get into the candidate mset
    probably has more chance of drifting up faster (assuming the two
    databases are roughly equally relevant to your query). Lots of caveats
    there, and my assumption may be wrong anyway :-)

    J

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Olly Betts at Jan 10, 2008 at 3:35 am

    On Thu, Dec 20, 2007 at 02:12:30PM +0000, James Aylett wrote:
    I don't actually know how the matcher deals with multiple databases
    right now, but I suspect it does it in a sort of pseudo-parallel [1],
    Actually, we process databases sequentially in this case. After the
    first database, we'll usually have an MSet full and so a decent minimum
    weight bound, so processing subsequent databases will usually be much
    quicker.

    This is likely to be more friendly if the databases are on the same
    disk(s), though it probably doesn't parallelise load so well if they
    aren't. But if your query load is high, concurrent queries will
    tend to do that for you anyway.

    I don't think we've tried processing databases in parallel, so it could
    be that would work better. It would be an interesting experiment if
    somebody wanted to try it.

    Cheers,
    Olly
  • James Aylett at Jan 10, 2008 at 8:24 pm

    On Thu, Jan 10, 2008 at 03:35:50AM +0000, Olly Betts wrote:

    I don't actually know how the matcher deals with multiple databases
    right now, but I suspect it does it in a sort of pseudo-parallel [1],
    Actually, we process databases sequentially in this case. After the
    first database, we'll usually have an MSet full and so a decent minimum
    weight bound, so processing subsequent databases will usually be much
    quicker.

    This is likely to be more friendly if the databases are on the same
    disk(s), though it probably doesn't parallelise load so well if they
    aren't. But if your query load is high, concurrent queries will
    tend to do that for you anyway.
    Both true. Hmm.
    I don't think we've tried processing databases in parallel, so it could
    be that would work better. It would be an interesting experiment if
    somebody wanted to try it.
    We'd need to devise a test case (better, several cases) with
    concurrent queries, using some sort of valid (or validatable)
    distribution of queries, against a database for which those queries
    are valid.

    Do you know (or can you look up) the proportion of GMane queries that
    are restricted to a specific group?

    J

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Olly Betts at Jan 11, 2008 at 12:30 am

    On Thu, Jan 10, 2008 at 08:24:54PM +0000, James Aylett wrote:
    On Thu, Jan 10, 2008 at 03:35:50AM +0000, Olly Betts wrote:

    I don't think we've tried processing databases in parallel, so it could
    be that would work better. It would be an interesting experiment if
    somebody wanted to try it.
    We'd need to devise a test case (better, several cases) with
    concurrent queries, using some sort of valid (or validatable)
    distribution of queries, against a database for which those queries
    are valid.
    Tweakers.net have kindly supplied some sanitised query logs. They're
    predominantly Dutch, but could reasonably be run against an index of
    Dutch wikipedia data.

    Otherwise, anyone with a large live system split over several databases
    could run tests and report the results.
    Do you know (or can you look up) the proportion of GMane queries that
    are restricted to a specific group?
    I could, though I don't really have time for such data-mining at the
    moment. I'm not sure what you'd hope to learn from that though...

    Cheers,
    Olly
  • James Aylett at Jan 11, 2008 at 1:12 pm

    On Fri, Jan 11, 2008 at 12:30:45AM +0000, Olly Betts wrote:

    We'd need to devise a test case (better, several cases) with
    concurrent queries, using some sort of valid (or validatable)
    distribution of queries, against a database for which those queries
    are valid.
    Tweakers.net have kindly supplied some sanitised query logs. They're
    predominantly Dutch, but could reasonably be run against an index of
    Dutch wikipedia data.

    Otherwise, anyone with a large live system split over several databases
    could run tests and report the results.
    It'd be nice if we could have some realistic test runs available
    publicly somewhere, but I imagine most people will be unwilling to
    give them out. Otherwise I'm worried that if (for instance) I start
    playing around with profiling on Solaris, I'll end up optimising for
    my usage pattern. (I can ignore things that appear obviously biased,
    but at the end of the day any optimisation is going to be biased
    somehow.)
    Do you know (or can you look up) the proportion of GMane queries that
    are restricted to a specific group?
    I could, though I don't really have time for such data-mining at the
    moment. I'm not sure what you'd hope to learn from that though...
    There was something, I'm sure of it. Can't remember now, though :(

    J

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedNov 21, '07 at 9:44p
activeJan 11, '08 at 1:12p
posts12
users4
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2021 Grokbase