The SOLR wiki has lots of good information, start there:http://wiki.apache.org/solr/
Otherwise, see below...
On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang wrote:
We are currently evaluating potential search frameworks (such as Hibernate
Search) which might be suitable to use in our project (using Spring, JPA
with Hibernate) ...
I am sending this E-Mail in hope you can advise me on a few issues that
would help us in our decision making process.
1.) Is Lucene suitable for full text database searches? I read Lucene
was designed to index and search documents but how does it behave querying
relational data sets in general?
Let's start be talking about the phrase "full text database searches". One
thing virtually all db-centric
people trip over is trying to use SOLR as if it were a database. You just
can't think about tables. The
first time you think about using SOLR to do something join-like, stop and
take a deep breath and
think about documents instead. The general approach is to flatten your data
so that each "document"
contains all the relevant info. Yes, this leads to de-normalization. Yes,
denormalized data makes a
good DBA cringe. But that's the difference between searching and using a
"Document" is somewhat misleading. A document in SOLR terms is just a
collection of fields. And, BTW,
there's no requirement that each document have the same fields (very unlike
2.) Can we make assumptions on query performance considering combined
searches, range queries or structured data and wildcard searches? If we
consider a data structure consisting of say 3 tables and each table contains
a few million entries (e.g. first name, last name and address fields) and we
search for common values (such as 'John', 'Smith' and 'New York') where
a. each value for itself and each combination would result in
millions of hits
Sure, but what those assumptions are is totally dependent on how you've set
things up. SOLR has been successfully
used on several billion document indexes. There are tools for making all
that work (i.e. replication, sharding, etc)
built into SOLR. So I suspect you can make things work. Several million
documents is not that large a data set.
As always, there are tradeoffs between speed and complexity. But from what
I see no show stoppers.
b. a person can have multiple first names and we want to make sure to
receive any combination of the last name with any first name
This just sounds like an OR. But the queries can be pretty complex queries.
Some examples of what you expect would help.
See multi-valued fields. So, a "document" can have multiple "firstname"
entries. Again, not like a DB (your reflexes will trip you
up on this point <G>).
c. we search for a last name and a range of birth dates
Sure, range queries work just fine. Note that dates can trip you up, look at
triedate if you experiment.
3.) Transaction safety: How does Lucene handle indexes? If we update
data model and index, what happens to the index if anything goes wrong as
soon as the data model has been persisted?
A lot of work has been done to make SOLR quite robust if "anything goes
wrong". That said, how are you backing up your data?
That is, what is the source of the data you're going to index? If you're
relying on your SOLR index to be your backup, you simply must back it up
somewhere "often enough" to get by if your building burns down. I'd also
think about storing your original input...
This is no different than a DB. you have to guard against the disk crashing,
someone walking by with a powerful magnet, earthquake, flood, fires
Do note that if you modify your index schema, no existing documents reflect
the new schema, you have to reindex them.
I hope I made the issues clear to you, just some general thoughts about how
Lucene would behave in a real world application scenario ... Any support or
pointers to helpful documents or Web links are highly appreciated!
Cheers for now,