FAQ
Hi all,

We are currently evaluating potential search frameworks (such as Hibernate Search) which might be suitable to use in our project (using Spring, JPA with Hibernate) ...
I am sending this E-Mail in hope you can advise me on a few issues that would help us in our decision making process.


1.) Is Lucene suitable for full text database searches? I read Lucene was designed to index and search documents but how does it behave querying relational data sets in general?

2.) Can we make assumptions on query performance considering combined searches, range queries or structured data and wildcard searches? If we consider a data structure consisting of say 3 tables and each table contains a few million entries (e.g. first name, last name and address fields) and we search for common values (such as 'John', 'Smith' and 'New York') where

a. each value for itself and each combination would result in millions of hits

b. a person can have multiple first names and we want to make sure to receive any combination of the last name with any first name

c. we search for a last name and a range of birth dates

3.) Transaction safety: How does Lucene handle indexes? If we update data model and index, what happens to the index if anything goes wrong as soon as the data model has been persisted?

I hope I made the issues clear to you, just some general thoughts about how Lucene would behave in a real world application scenario ... Any support or pointers to helpful documents or Web links are highly appreciated!
Cheers for now,

w

Search Discussions

  • Chris Lu at Aug 25, 2010 at 5:28 pm
    I see you are coming from the database world. To get a better understanding
    of Lucene, I would suggest you use the free version of DBSight, which let
    you create Lucene index with SQL after a few clicks.

    Basically Lucene is more like a list of denormalized documents. So if you
    change your database schema, your lucene index would likely need to be
    changed and re-built also.

    The combined full text search and range queries are easy in Lucene index.
    You don't need to worry about it.

    It's a great question about 3). If anything goes wrong, you would need to
    rebuild the index. So you would need a mechanism to get prepared and rebuild
    the index when you need to.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
    DBSight customer, a shopping comparison site, (anonymous per request) got
    2.6 Million Euro funding!
    On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang wrote:

    Hi all,

    We are currently evaluating potential search frameworks (such as Hibernate
    Search) which might be suitable to use in our project (using Spring, JPA
    with Hibernate) ...
    I am sending this E-Mail in hope you can advise me on a few issues that
    would help us in our decision making process.


    1.) Is Lucene suitable for full text database searches? I read Lucene
    was designed to index and search documents but how does it behave querying
    relational data sets in general?

    2.) Can we make assumptions on query performance considering combined
    searches, range queries or structured data and wildcard searches? If we
    consider a data structure consisting of say 3 tables and each table contains
    a few million entries (e.g. first name, last name and address fields) and we
    search for common values (such as 'John', 'Smith' and 'New York') where

    a. each value for itself and each combination would result in
    millions of hits

    b. a person can have multiple first names and we want to make sure to
    receive any combination of the last name with any first name

    c. we search for a last name and a range of birth dates

    3.) Transaction safety: How does Lucene handle indexes? If we update
    data model and index, what happens to the index if anything goes wrong as
    soon as the data model has been persisted?

    I hope I made the issues clear to you, just some general thoughts about how
    Lucene would behave in a real world application scenario ... Any support or
    pointers to helpful documents or Web links are highly appreciated!
    Cheers for now,

    w
  • Erick Erickson at Aug 25, 2010 at 5:34 pm
    The SOLR wiki has lots of good information, start there:
    http://wiki.apache.org/solr/

    Otherwise, see below...
    On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang wrote:

    Hi all,

    We are currently evaluating potential search frameworks (such as Hibernate
    Search) which might be suitable to use in our project (using Spring, JPA
    with Hibernate) ...
    I am sending this E-Mail in hope you can advise me on a few issues that
    would help us in our decision making process.


    1.) Is Lucene suitable for full text database searches? I read Lucene
    was designed to index and search documents but how does it behave querying
    relational data sets in general?
    Let's start be talking about the phrase "full text database searches". One
    thing virtually all db-centric
    people trip over is trying to use SOLR as if it were a database. You just
    can't think about tables. The
    first time you think about using SOLR to do something join-like, stop and
    take a deep breath and
    think about documents instead. The general approach is to flatten your data
    so that each "document"
    contains all the relevant info. Yes, this leads to de-normalization. Yes,
    denormalized data makes a
    good DBA cringe. But that's the difference between searching and using a
    RDBMS.

    "Document" is somewhat misleading. A document in SOLR terms is just a
    collection of fields. And, BTW,
    there's no requirement that each document have the same fields (very unlike
    a DB).

    2.) Can we make assumptions on query performance considering combined
    searches, range queries or structured data and wildcard searches? If we
    consider a data structure consisting of say 3 tables and each table contains
    a few million entries (e.g. first name, last name and address fields) and we
    search for common values (such as 'John', 'Smith' and 'New York') where

    a. each value for itself and each combination would result in
    millions of hits
    Sure, but what those assumptions are is totally dependent on how you've set
    things up. SOLR has been successfully
    used on several billion document indexes. There are tools for making all
    that work (i.e. replication, sharding, etc)
    built into SOLR. So I suspect you can make things work. Several million
    documents is not that large a data set.

    As always, there are tradeoffs between speed and complexity. But from what
    you've described
    I see no show stoppers.

    b. a person can have multiple first names and we want to make sure to
    receive any combination of the last name with any first name
    This just sounds like an OR. But the queries can be pretty complex queries.
    Some examples of what you expect would help.
    See multi-valued fields. So, a "document" can have multiple "firstname"
    entries. Again, not like a DB (your reflexes will trip you
    up on this point <G>).

    c. we search for a last name and a range of birth dates
    Sure, range queries work just fine. Note that dates can trip you up, look at
    triedate if you experiment.

    3.) Transaction safety: How does Lucene handle indexes? If we update
    data model and index, what happens to the index if anything goes wrong as
    soon as the data model has been persisted?
    A lot of work has been done to make SOLR quite robust if "anything goes
    wrong". That said, how are you backing up your data?
    That is, what is the source of the data you're going to index? If you're
    relying on your SOLR index to be your backup, you simply must back it up
    somewhere "often enough" to get by if your building burns down. I'd also
    think about storing your original input...

    This is no different than a DB. you have to guard against the disk crashing,
    someone walking by with a powerful magnet, earthquake, flood, fires
    <G>.....

    Do note that if you modify your index schema, no existing documents reflect
    the new schema, you have to reindex them.

    I hope I made the issues clear to you, just some general thoughts about how
    Lucene would behave in a real world application scenario ... Any support or
    pointers to helpful documents or Web links are highly appreciated!
    Cheers for now,

    w
  • Lance Norskog at Aug 26, 2010 at 3:25 am
    A stepping stone to the above is that, in DB terms, a Lucene index is
    only one table. It has a suite of indexing features that are very
    different from database search. The features are oriented to searching
    large bodies of text for "ideas" rather than concrete words. It
    searches a lot faster than a DB. It also spends more time creating its
    various indexes than a DB. Other points- you can't add or drop fields
    or indexes.

    On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
    wrote:
    The SOLR wiki has lots of good information, start there:
    http://wiki.apache.org/solr/

    Otherwise, see below...

    On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
    Wolfgang.Schreiner@itsv.at> wrote:
    Hi all,

    We are currently evaluating potential search frameworks (such as Hibernate
    Search) which might be suitable to use in our project (using Spring, JPA
    with Hibernate) ...
    I am sending this E-Mail in hope you can advise me on a few issues that
    would help us in our decision making process.


    1.)    Is Lucene suitable for full text database searches? I read Lucene
    was designed to index and search documents but how does it behave querying
    relational data sets in general?
    Let's start be talking about the phrase "full text database searches". One
    thing virtually all db-centric
    people trip over is trying to use SOLR as if it were a database. You just
    can't think about tables. The
    first time you think about using SOLR to do something join-like, stop and
    take a deep breath and
    think about documents instead. The general approach is to flatten your data
    so that each "document"
    contains all the relevant info. Yes, this leads to de-normalization. Yes,
    denormalized data makes a
    good DBA cringe. But that's the difference between searching and using a
    RDBMS.

    "Document" is somewhat misleading. A document in SOLR terms is just a
    collection of fields. And, BTW,
    there's no requirement that each document have the same fields (very unlike
    a DB).

    2.)    Can we make assumptions on query performance considering combined
    searches, range queries or structured data and wildcard searches? If we
    consider a data structure consisting of say 3 tables and each table contains
    a few million entries (e.g. first name, last name and address fields) and we
    search for common values (such as 'John', 'Smith' and 'New York') where

    a.       each value for itself and each combination would result in
    millions of hits
    Sure, but what those assumptions are is totally dependent on how you've set
    things up. SOLR has been successfully
    used on several billion document indexes. There are tools for making all
    that work (i.e. replication, sharding, etc)
    built into SOLR. So I suspect you can make things work. Several million
    documents is not that large a data set.

    As always, there are tradeoffs between speed and complexity. But from what
    you've described
    I see no show stoppers.

    b.      a person can have multiple first names and we want to make sure to
    receive any combination of the last name with any first name
    This just sounds like an OR. But the queries can be pretty complex queries.
    Some examples of what you expect would help.
    See multi-valued fields. So, a "document" can have multiple "firstname"
    entries. Again, not like a DB (your reflexes will trip you
    up on this point <G>).

    c.       we search for a last name and a range of birth dates
    Sure, range queries work just fine. Note that dates can trip you up, look at
    triedate if you experiment.

    3.)    Transaction safety: How does Lucene handle indexes? If we update
    data model and index, what happens to the index if anything goes wrong as
    soon as the data model has been persisted?
    A lot of work has been done to make SOLR quite robust if "anything goes
    wrong". That said, how are you backing up your data?
    That is, what is the source of the data you're going to index? If you're
    relying on your SOLR index to be your backup, you simply must back it up
    somewhere "often enough" to get by if your building burns down. I'd also
    think about storing your original input...

    This is no different than a DB. you have to guard against the disk crashing,
    someone walking by with a powerful magnet,  earthquake, flood, fires
    <G>.....

    Do note that if you modify your index schema, no existing documents reflect
    the new schema, you have to reindex them.

    I hope I made the issues clear to you, just some general thoughts about how
    Lucene would behave in a real world application scenario ... Any support or
    pointers to helpful documents or Web links are highly appreciated!
    Cheers for now,

    w


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Schreiner Wolfgang at Aug 31, 2010 at 9:18 am
    Hi,

    Thank you all for your time to answer my questions!
    However there are a few more issues which are not quite clear yet and hope to get advice on those too:

    1.) How is the index maintained? In another product where we use an indexer different from Lucene, we got one central index and a few JBoss servers all accessing the same index. So how does Lucene handle synchronization between multiple threads (different JVMs)? How does it maintain the index after update/delete operations on the database?
    2.) Is the index always up-to-date? In the FAQs it says we have to re-open the IndexReader periodically ... how expensive (in computational terms) is it to do that on every request for instance?
    3.) I'm still not sure about performance. According to the FAQs we need to build our own MultiPhraseQuery parser to support multiple terms and wildcards. For example consider 50.000.000 documents, 50.000 of them match term T1 in category A, 50.000 match term T2 in category B and 1.000.000 match term T3 in category C. 50 Match T1 in A and T2 in B and T3 in C. How fast is the algorithm in this case? Who guarantees that it doesn't start at the 1 million side?

    Cheers,

    w


    -----Original Message-----
    From: Lance Norskog
    Sent: Donnerstag, 26. August 2010 05:25
    To: java-user@lucene.apache.org
    Subject: Re: Lucene applicability

    A stepping stone to the above is that, in DB terms, a Lucene index is
    only one table. It has a suite of indexing features that are very
    different from database search. The features are oriented to searching
    large bodies of text for "ideas" rather than concrete words. It
    searches a lot faster than a DB. It also spends more time creating its
    various indexes than a DB. Other points- you can't add or drop fields
    or indexes.

    On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
    wrote:
    The SOLR wiki has lots of good information, start there:
    http://wiki.apache.org/solr/

    Otherwise, see below...

    On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
    Wolfgang.Schreiner@itsv.at> wrote:
    Hi all,

    We are currently evaluating potential search frameworks (such as Hibernate
    Search) which might be suitable to use in our project (using Spring, JPA
    with Hibernate) ...
    I am sending this E-Mail in hope you can advise me on a few issues that
    would help us in our decision making process.


    1.)    Is Lucene suitable for full text database searches? I read Lucene
    was designed to index and search documents but how does it behave querying
    relational data sets in general?
    Let's start be talking about the phrase "full text database searches". One
    thing virtually all db-centric
    people trip over is trying to use SOLR as if it were a database. You just
    can't think about tables. The
    first time you think about using SOLR to do something join-like, stop and
    take a deep breath and
    think about documents instead. The general approach is to flatten your data
    so that each "document"
    contains all the relevant info. Yes, this leads to de-normalization. Yes,
    denormalized data makes a
    good DBA cringe. But that's the difference between searching and using a
    RDBMS.

    "Document" is somewhat misleading. A document in SOLR terms is just a
    collection of fields. And, BTW,
    there's no requirement that each document have the same fields (very unlike
    a DB).

    2.)    Can we make assumptions on query performance considering combined
    searches, range queries or structured data and wildcard searches? If we
    consider a data structure consisting of say 3 tables and each table contains
    a few million entries (e.g. first name, last name and address fields) and we
    search for common values (such as 'John', 'Smith' and 'New York') where

    a.       each value for itself and each combination would result in
    millions of hits
    Sure, but what those assumptions are is totally dependent on how you've set
    things up. SOLR has been successfully
    used on several billion document indexes. There are tools for making all
    that work (i.e. replication, sharding, etc)
    built into SOLR. So I suspect you can make things work. Several million
    documents is not that large a data set.

    As always, there are tradeoffs between speed and complexity. But from what
    you've described
    I see no show stoppers.

    b.      a person can have multiple first names and we want to make sure to
    receive any combination of the last name with any first name
    This just sounds like an OR. But the queries can be pretty complex queries.
    Some examples of what you expect would help.
    See multi-valued fields. So, a "document" can have multiple "firstname"
    entries. Again, not like a DB (your reflexes will trip you
    up on this point <G>).

    c.       we search for a last name and a range of birth dates
    Sure, range queries work just fine. Note that dates can trip you up, look at
    triedate if you experiment.

    3.)    Transaction safety: How does Lucene handle indexes? If we update
    data model and index, what happens to the index if anything goes wrong as
    soon as the data model has been persisted?
    A lot of work has been done to make SOLR quite robust if "anything goes
    wrong". That said, how are you backing up your data?
    That is, what is the source of the data you're going to index? If you're
    relying on your SOLR index to be your backup, you simply must back it up
    somewhere "often enough" to get by if your building burns down. I'd also
    think about storing your original input...

    This is no different than a DB. you have to guard against the disk crashing,
    someone walking by with a powerful magnet,  earthquake, flood, fires
    <G>.....

    Do note that if you modify your index schema, no existing documents reflect
    the new schema, you have to reindex them.

    I hope I made the issues clear to you, just some general thoughts about how
    Lucene would behave in a real world application scenario ... Any support or
    pointers to helpful documents or Web links are highly appreciated!
    Cheers for now,

    w


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Aug 31, 2010 at 12:05 pm
    See below
    On Tue, Aug 31, 2010 at 5:17 AM, Schreiner Wolfgang wrote:

    Hi,

    Thank you all for your time to answer my questions!
    However there are a few more issues which are not quite clear yet and hope
    to get advice on those too:

    1.) How is the index maintained? In another product where we use an indexer
    different from Lucene, we got one central index and a few JBoss servers all
    accessing the same index. So how does Lucene handle synchronization between
    multiple threads (different JVMs)? How does it maintain the index after
    update/delete operations on the database?
    It doesn't, you have to do it yourself. You'll have to write some app that
    periodically queries your db and when it detecte changes, updates the Lucene
    index. You should only have one JVM have an open writer at a time. You can
    have as many readers from as many jvms as you want. Note that your single
    JVM that has a writer can use multiple write threads. In other words writers
    are thread safe..

    2.) Is the index always up-to-date? In the FAQs it says we have to re-open
    the IndexReader periodically ... how expensive (in computational terms) is
    it to do that on every request for instance?
    the index is always up to date, how could it be otherwise <G>. When you open
    a reader, Lucene essentially takes a snapshot of the index at that moment.
    Any updated are not visible to that reader, thus the comment about reopening
    the reader.

    3.) I'm still not sure about performance. According to the FAQs we need to
    build our own MultiPhraseQuery parser to support multiple terms and
    wildcards. For example consider 50.000.000 documents, 50.000 of them match
    term T1 in category A, 50.000 match term T2 in category B and 1.000.000
    match term T3 in category C. 50 Match T1 in A and T2 in B and T3 in C. How
    fast is the algorithm in this case? Who guarantees that it doesn't start at
    the 1 million side?
    That can't be answered in the abstract because there are too many variables,
    you have to measure. But there are Lucene installations that have many more
    documents than that so you can probably get there. Consider SOLR if you need
    to replicate or shard your indexes because they get too big.

    I think you're probably thinking of Lucene in terms of an application, not
    an engine. Lucene is what you build your application around, it provides the
    guts of the searching. The rest is up to your.

    HTH
    Erick

    Cheers,
    w


    -----Original Message-----
    From: Lance Norskog
    Sent: Donnerstag, 26. August 2010 05:25
    To: java-user@lucene.apache.org
    Subject: Re: Lucene applicability

    A stepping stone to the above is that, in DB terms, a Lucene index is
    only one table. It has a suite of indexing features that are very
    different from database search. The features are oriented to searching
    large bodies of text for "ideas" rather than concrete words. It
    searches a lot faster than a DB. It also spends more time creating its
    various indexes than a DB. Other points- you can't add or drop fields
    or indexes.

    On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
    wrote:
    The SOLR wiki has lots of good information, start there:
    http://wiki.apache.org/solr/

    Otherwise, see below...

    On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
    Wolfgang.Schreiner@itsv.at> wrote:
    Hi all,

    We are currently evaluating potential search frameworks (such as
    Hibernate
    Search) which might be suitable to use in our project (using Spring, JPA
    with Hibernate) ...
    I am sending this E-Mail in hope you can advise me on a few issues that
    would help us in our decision making process.


    1.) Is Lucene suitable for full text database searches? I read Lucene
    was designed to index and search documents but how does it behave
    querying
    relational data sets in general?
    Let's start be talking about the phrase "full text database searches". One
    thing virtually all db-centric
    people trip over is trying to use SOLR as if it were a database. You just
    can't think about tables. The
    first time you think about using SOLR to do something join-like, stop and
    take a deep breath and
    think about documents instead. The general approach is to flatten your data
    so that each "document"
    contains all the relevant info. Yes, this leads to de-normalization. Yes,
    denormalized data makes a
    good DBA cringe. But that's the difference between searching and using a
    RDBMS.

    "Document" is somewhat misleading. A document in SOLR terms is just a
    collection of fields. And, BTW,
    there's no requirement that each document have the same fields (very unlike
    a DB).

    2.) Can we make assumptions on query performance considering combined
    searches, range queries or structured data and wildcard searches? If we
    consider a data structure consisting of say 3 tables and each table
    contains
    a few million entries (e.g. first name, last name and address fields)
    and we
    search for common values (such as 'John', 'Smith' and 'New York') where

    a. each value for itself and each combination would result in
    millions of hits
    Sure, but what those assumptions are is totally dependent on how you've set
    things up. SOLR has been successfully
    used on several billion document indexes. There are tools for making all
    that work (i.e. replication, sharding, etc)
    built into SOLR. So I suspect you can make things work. Several million
    documents is not that large a data set.

    As always, there are tradeoffs between speed and complexity. But from what
    you've described
    I see no show stoppers.

    b. a person can have multiple first names and we want to make sure
    to
    receive any combination of the last name with any first name
    This just sounds like an OR. But the queries can be pretty complex queries.
    Some examples of what you expect would help.
    See multi-valued fields. So, a "document" can have multiple "firstname"
    entries. Again, not like a DB (your reflexes will trip you
    up on this point <G>).

    c. we search for a last name and a range of birth dates
    Sure, range queries work just fine. Note that dates can trip you up, look at
    triedate if you experiment.

    3.) Transaction safety: How does Lucene handle indexes? If we update
    data model and index, what happens to the index if anything goes wrong
    as
    soon as the data model has been persisted?
    A lot of work has been done to make SOLR quite robust if "anything goes
    wrong". That said, how are you backing up your data?
    That is, what is the source of the data you're going to index? If you're
    relying on your SOLR index to be your backup, you simply must back it up
    somewhere "often enough" to get by if your building burns down. I'd also
    think about storing your original input...

    This is no different than a DB. you have to guard against the disk crashing,
    someone walking by with a powerful magnet, earthquake, flood, fires
    <G>.....

    Do note that if you modify your index schema, no existing documents reflect
    the new schema, you have to reindex them.

    I hope I made the issues clear to you, just some general thoughts about
    how
    Lucene would behave in a real world application scenario ... Any support
    or
    pointers to helpful documents or Web links are highly appreciated!
    Cheers for now,

    w


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 25, '10 at 1:21p
activeAug 31, '10 at 12:05p
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase