FAQ
Hi,

what is the best approach to merge a database index with a lucene fulltext
index? Both databases store a unique ID per doc. This is the join criteria.

requirements:

* both resultsets may be very big (100.000 and much more)
* the merged resultset must be sorted by database index and/or relevance
* optional paging the merged resultset, a page has a size of 1000 docs max.

example:

select a, b from dbtable where c = 'foo' and content='bar' order by
relevance, a desc, d

I would split this into:

database: select ID, a, b from dbtable where c = 'foo' order by a desc, d
lucene: content:bar (sort:relevance)
merge: loop over the lucene resultset and add the db record into a new list
if the ID matches.

If the resultset must be paged:

database: select ID from dbtable where c = 'foo' order by a desc, d
lucene: content:bar (sort:relevance)
merge: loop over the lucene resultset and add the db record into a new list
if the ID matches.
page 1: select a,b from dbtable where ID IN (list of the ID's of page 1)
page 2: select a,b from dbtable where ID IN (list of the ID's of page 2)
...


Is there a better way?

Thank you.





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Chris Lu at Feb 28, 2009 at 8:04 pm
    I feel this may not be a good example. Since you can easily index field c,
    a, d and let Lucene to handle the filter "c = 'foo'" and the order by
    clause"order by a desc, d"

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    DBSight customer, a shopping comparison site, (anonymous per request) got
    2.6 Million Euro funding!
    On Sat, Feb 28, 2009 at 11:07 AM, wrote:

    Hi,

    what is the best approach to merge a database index with a lucene fulltext
    index? Both databases store a unique ID per doc. This is the join criteria.

    requirements:

    * both resultsets may be very big (100.000 and much more)
    * the merged resultset must be sorted by database index and/or relevance
    * optional paging the merged resultset, a page has a size of 1000 docs max.

    example:

    select a, b from dbtable where c = 'foo' and content='bar' order by
    relevance, a desc, d

    I would split this into:

    database: select ID, a, b from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.

    If the resultset must be paged:

    database: select ID from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.
    page 1: select a,b from dbtable where ID IN (list of the ID's of page 1)
    page 2: select a,b from dbtable where ID IN (list of the ID's of page 2)
    ...


    Is there a better way?

    Thank you.





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Spring at Feb 28, 2009 at 8:59 pm
    I feel this may not be a good example.
    It was a very simple example.

    The real database query is very complex and joins serveral tables.
    It would be an absolute nightmare to copy all these tables into lucene and
    keep both in sync.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Feb 28, 2009 at 8:47 pm
    I'll second Chris's comment and ask whether you've considered
    denormalizing your data into Lucene and sticking exclusively
    with Lucene?

    Contrariwise, look for anything by Marcelo Ochoa on the user list
    about embedding Lucene in Oracle (which I confess I haven't looked
    into at all, but seems interesting).

    Best
    Erick
    On Sat, Feb 28, 2009 at 2:07 PM, wrote:

    Hi,

    what is the best approach to merge a database index with a lucene fulltext
    index? Both databases store a unique ID per doc. This is the join criteria.

    requirements:

    * both resultsets may be very big (100.000 and much more)
    * the merged resultset must be sorted by database index and/or relevance
    * optional paging the merged resultset, a page has a size of 1000 docs max.

    example:

    select a, b from dbtable where c = 'foo' and content='bar' order by
    relevance, a desc, d

    I would split this into:

    database: select ID, a, b from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.

    If the resultset must be paged:

    database: select ID from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.
    page 1: select a,b from dbtable where ID IN (list of the ID's of page 1)
    page 2: select a,b from dbtable where ID IN (list of the ID's of page 2)
    ...


    Is there a better way?

    Thank you.





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Spring at Feb 28, 2009 at 9:01 pm

    Contrariwise, look for anything by Marcelo Ochoa on the user list
    about embedding Lucene in Oracle (which I confess I haven't looked
    into at all, but seems interesting).
    I know this lucene-oracle text cartridge.
    But my solution has to work with any of the big databases (MS, IBM, Oracle).


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Feb 28, 2009 at 9:09 pm
    Actually you can use DBSight(disclaimer:I work on it) to collect the data
    and keep them in sync.
    The free version has most the features and doesn't have size limit.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    DBSight customer, a shopping comparison site, (anonymous per request) got
    2.6 Million Euro funding!

    On Sat, Feb 28, 2009 at 1:00 PM, wrote:

    Contrariwise, look for anything by Marcelo Ochoa on the user list
    about embedding Lucene in Oracle (which I confess I haven't looked
    into at all, but seems interesting).
    I know this lucene-oracle text cartridge.
    But my solution has to work with any of the big databases (MS, IBM,
    Oracle).


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Spring at Feb 28, 2009 at 11:07 pm

    Actually you can use DBSight(disclaimer:I work on it) to
    collect the data
    and keep them in sync.
    Hm... it fulltext-indexes a database?
    It supports document content outside the database (custom crawler)?
    What query-syntax it supports?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Mar 1, 2009 at 4:09 am
    Yes. DBSight helps to flatten database objects into Lucene's documents. It's
    more like Lucene-On-Rails. Custom crawler is supported via java api to crawl
    outside database. DBSight query syntax and Lucene query syntax are both
    supported, in addition to customizable analyzer, similarity, ranking, etc.

    I think you better try it first. It's faster to install it, select the
    content with your sql, and get the search up and running, than reading
    introduction materials.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    DBSight customer, a shopping comparison site, (anonymous per request) got
    2.6 Million Euro funding!
    On Sat, Feb 28, 2009 at 1:33 PM, wrote:

    Actually you can use DBSight(disclaimer:I work on it) to
    collect the data
    and keep them in sync.
    Hm... it fulltext-indexes a database?
    It supports document content outside the database (custom crawler)?
    What query-syntax it supports?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Spring at Mar 1, 2009 at 7:20 am

    Yes. DBSight helps to flatten database objects into Lucene's
    documents.
    OK, thx for the advice.

    But back to my original question.

    When I have to merge both resultsets, what is the best approach to do this?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Mar 1, 2009 at 3:21 pm
    I think the message is don't even try unless you're explored the
    alternatives and found them inadequate.

    Best
    Erick
    On Sun, Mar 1, 2009 at 2:19 AM, wrote:

    Yes. DBSight helps to flatten database objects into Lucene's
    documents.
    OK, thx for the advice.

    But back to my original question.

    When I have to merge both resultsets, what is the best approach to do this?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Glen Newton at Mar 2, 2009 at 3:14 am
    I would suggest you try LuSql, which was designed specifically to
    index relational databases into Lucene.

    It has an extensive user manual/tutorial which has some complex
    examples involving multi-joins and sub-queries.

    I am the author of LuSql.
    LuSql home page:
    http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
    LuSql manual: http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

    thanks,

    Glen

    2009/2/28 <spring@gmx.eu>:
    Hi,

    what is the best approach to merge a database index with a lucene fulltext
    index? Both databases store a unique ID per doc. This is the join criteria.

    requirements:

    * both resultsets may be very big (100.000 and much more)
    * the merged resultset must be sorted by database index and/or relevance
    * optional paging the merged resultset, a page has a size of 1000 docs max.

    example:

    select a, b from dbtable where c = 'foo' and content='bar' order by
    relevance, a desc, d

    I would split this into:

    database: select ID, a, b from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.

    If the resultset must be paged:

    database: select ID from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.
    page 1: select a,b from dbtable where ID IN (list of the ID's of page 1)
    page 2: select a,b from dbtable where ID IN (list of the ID's of page 2)
    ...


    Is there a better way?

    Thank you.





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    -

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcelo Ochoa at Mar 2, 2009 at 12:02 pm
    Hi:
    The point to catch with bad performance during merging a database
    result is to reduce the number of rows visited by your first query.
    As an example take a look a these two queries using Lucene Domain
    Index, the two are equivalents:
    Option A:

    select * from (select rownum as ntop_pos,q.* from (
    select extractValue(object_value,'/page/revision/timestamp'),extractValue(object_value,'/page/title')
    from pages where lcontains(object_value,
    'musica')>0
    and extractValue(object_value,'/page/revision/timestamp')
    between TO_TIMESTAMP_TZ('06-JAN-07 12.20.05.000000000 PM +00:00')
    and TO_TIMESTAMP_TZ('17-JUL-07 11.47.38.000000000 AM +00:00')
    order by extractValue(object_value,'/page/revision/timestamp')) q)
    where ntop_pos>=20 and ntop_pos<=30;

    Option B:

    select /*+ DOMAIN_INDEX_SORT */
    extractValue(object_value,'/page/revision/timestamp'),extractValue(object_value,'/page/title')
    from pages where lcontains(object_value,
    'rownum:[20 TO 30] AND musica AND revisionDate:[20070101 TO
    20070718]','revisionDate')>0;

    First query is using all traditional SQL syntax to do filtering,
    sorting and pagination (Oracle Top-N syntax), the second query is
    using filtering (revisionDate:[20070101 TO 20070718]), sorting
    (revisionDate) and pagination (rownum:[20 TO 30], Lucene Domain Index
    syntax) resolved inside the Lucene Domain Index.
    In execution time the two queries over a sub set (around 32000 pages)
    of WikiPedia Dumps uploaded into an Oracle 11g are 4 minutes for the
    first option and 55 millisecond for the second option.
    The big difference is how many rows the DB need to visits and then
    discard, for the first option my DB performs 2.900.671 buffer gets
    (block disk that are loaded into memory) and 21 for the second option.
    In second execution plan the optimizer receives the exact 10 rows to
    return by the Domain Index.
    So, no matter what the technology used, the more you can filter on the
    index, the faster will be the query.
    Obviously there will be queries when this rule is not true, for
    example if you have a bit map index on some column, querying the
    bitmap index first could be faster than a Domain Index scan, but the
    optimizer knows the true.
    Best regards, Marcelo.

    PD: If you need more information about how to use or how Lucene Domain
    Index works inside Oracle please take a look at:
    http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
    On Sat, Feb 28, 2009 at 5:07 PM, wrote:
    Hi,

    what is the best approach to merge a database index with a lucene fulltext
    index? Both databases store a unique ID per doc. This is the join criteria.

    requirements:

    * both resultsets may be very big (100.000 and much more)
    * the merged resultset must be sorted by database index and/or relevance
    * optional paging the merged resultset, a page has a size of 1000 docs max.

    example:

    select a, b from dbtable where c = 'foo' and content='bar' order by
    relevance, a desc, d

    I would split this into:

    database: select ID, a, b from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.

    If the resultset must be paged:

    database: select ID from dbtable where c = 'foo' order by a desc, d
    lucene: content:bar (sort:relevance)
    merge: loop over the lucene resultset and add the db record into a new list
    if the ID matches.
    page 1: select a,b from dbtable where ID IN (list of the ID's of page 1)
    page 2: select a,b from dbtable where ID IN (list of the ID's of page 2)
    ...


    Is there a better way?

    Thank you.





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Marcelo F. Ochoa
    http://marceloochoa.blogspot.com/
    http://marcelo.ochoa.googlepages.com/home
    ______________
    Want to integrate Lucene and Oracle?
    http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
    Is Oracle 11g REST ready?
    http://marceloochoa.blogspot.com/2008/02/is-oracle-11g-rest-ready.html

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 28, '09 at 7:08p
activeMar 2, '09 at 12:02p
posts12
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase