FAQ

Realtime Search for Social Networks Collaboration

J. Delgado
Sep 22, 2008 at 3:54 am
Please ignore the correction... "lose" is fine:-)
On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado wrote:

Sorry, I meant "loose" (replacing "lose")

On Sun, Sep 21, 2008 at 8:38 PM, J. Delgado wrote:

On Sat, Sep 20, 2008 at 1:04 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.paul@gmail.com> wrote:
Moving back to RDBMS model will be a big step backwards where we miss
mulivalued fields and arbitrary fields .

No one is suggesting to "lose" any of the virtues of the field based
indexing that Lucene provides. All but the contrary: by extending the RDBMS
model with Lucene-based indexes one can map relational rows to documents and
columns to fields. Note that one relational field can be mapped to one or
more text based fields and multi-valued fields will still be allowed.

Please check the Lucence OJVM implementation for details on implementation
and philosophy on the RDBMS-Lucene converged model:

http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg

More discussions at Marcelo's blog who will be presenting in Oracle World
2008 this week.
http://marceloochoa.blogspot.com/

BTW, it just happen that this was implemented using Oracle but similar
implementation in H2 seems not only feasible but desirable.

-- Joaquin


On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen
wrote:
Cool. I mention H2 because it does have some Lucene code in it yes.
Also according to some benchmarks it's the fastest of the open source
databases. I think it's possible to integrate realtime search for H2.
I suppose there is no need to store the data in Lucene in this case?
One loses the multiple values per field Lucene offers, and the schema
become static. Perhaps it's a trade off?

On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado <joaquin.delgado@gmail.com>
wrote:
Yes, both Marcelo and I would be interested.

We looked into H2 and it looks like something similar to Oracle's ODCI
can
be implemented. Plus the primitive full-text implementación is based
on
Lucene.
I say primitive because looking at the code I saw that one cannot
define an
Analyzer and for each scan corresponding to a where clause a searcher
is
open and closed, instead of having a pool, plus it does not have any
way to
queue changes to reduce the use of the IndexWriter, etc.

But its open source and that is a great starting point!

-- Joaquin

On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen
wrote:
Perhaps an interesting project would be to integrate Ocean with H2
www.h2database.com to take advantage of both models. I'm not sure
how
exactly that would work, but it seems like it would not be too
difficult. Perhaps this would solve being able to perform faster
hierarchical queries and perhaps other types of queries that Lucene
is
not capable of.

Is this something Joaquin you are interested in collaborating on? I
am definitely interested in it.

On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado <
joaquin.delgado@gmail.com>
wrote:
On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic
wrote:
Regarding real-time search and Solr, my feeling is the focus
should be
on
first adding real-time search to Lucene, and then we'll figure out
how
to
incorporate that into Solr later.

Otis, what do you mean exactly by "adding real-time search to
Lucene"?
Note
that Lucene, being a indexing/search library (and not a full blown
search
engine), is by definition "real-time": once you add/write a
document to
the
index it becomes immediately searchable and if a document is
logically
deleted and no longer returned in a search, though physical
deletion
happens
during an index optimization.

Now, the problem of adding/deleting documents in bulk, as part of a
transaction and making these documents available for search
immediately
after the transaction is commited sounds more like a search engine
problem
(i.e. SOLR, Nutch, Ocean), specially if these transactions are
known to
be
I/O expensive and thus are usually implemented bached proceeses
with
some
kind of sync mechanism, which makes them non real-time.

For example, in my previous life, I designed and help implement a
quasi-realtime enterprise search engine using Lucene, having a set
of
multi-threaded indexers hitting a set of multiple indexes alocatted
accross
different search services which powered a broker based distributed
search
interface. The most recent documents provided to the indexers were
always
added to the smaller in-memory (RAM) indexes which usually could
absorbe
the
load of a bulk "add" transaction and later would be merged into
larger
disk
based indexes and then flushed to make them ready to absorbe new
fresh
docs.
We even had further partitioning of the indexes that reflected time
periods
with caps on size for them to be merged into older more archive
based
indexes which were used less (yes the search engine default search
was
on
data no more than 1 month old, though user could open the time
window by
including archives).

As for SOLR and OCEAN, I would argue that these semi-structured
search
engines are becomming more and more like relational databases with
full-text
search capablities (without the benefit of full reletional algebra
--
for
example joins are not possible using SOLR). Notice that "real-time"
CRUD
operations and transactionality are core DB concepts adn have been
studied
and developed by database communities for aquite long time. There
has
been
recent efforts on how to effeciently integrate Lucene into
releational
databases (see Lucene JVM ORACLE integration, see
http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html
)
I think we should seriously look at joining efforts with
open-source
Database engine projects, written in Java (see
http://java-source.net/open-source/database-engines) in order to
blend
IR
and ORM for once and for all.

-- Joaquin

I've read Jason's Wiki as well. Actually, I had to read it a
number of
times to understand bits and pieces of it. I have to admit there
is
still
some fuzziness about the whole things in my head - is "Ocean"
something
that
already works, a separate project on googlecode.com? I think so.
If
so,
and if you are working on getting it integrated into Lucene, would
it
make
it less confusing to just refer to it as "real-time search", so
there
is no
confusion?

If this is to be initially integrated into Lucene, why are things
like
replication, crowding/field collapsing, locallucene, name service,
tag
index, etc. all mentioned there on the Wiki and bundled with
description of
how real-time search works and is to be implemented? I suppose
mentioning
replication kind-of makes sense because the replication approach
is
closely
tied to real-time search - all query nodes need to see index
changes
fast.
But Lucene itself offers no replication mechanism, so maybe the
replication
is something to figure out separately, say on the Solr level,
later on
"once
we get there". I think even just the essential real-time search
requires
substantial changes to Lucene (I remember seeing large patches in
JIRA),
which makes it hard to digest, understand, comment on, and
ultimately
commit
(hence the luke warm response, I think). Bringing other
non-essential
elements into discussion at the same time makes it more difficult
t o
process all this new stuff, at least for me. Am I the only one
who
finds
this hard?

That said, it sounds like we have some discussion going (Karl...),
so I
look forward to understanding more! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Yonik Seeley <yonik@apache.org>
To: java-dev@lucene.apache.org
Sent: Thursday, September 4, 2008 10:13:32 AM
Subject: Re: Realtime Search for Social Networks Collaboration

On Wed, Sep 3, 2008 at 6:50 PM, Jason Rutherglen
wrote:
I also think it's got a
lot of things now which makes integration difficult to do
properly.
I agree, and that's why the major bump in version number rather
than
minor - we recognize that some features will need some amount of
rearchitecture.
I think the problem with integration with SOLR is it was
designed
with
a different problem set in mind than Ocean, originally the
CNET
shopping application.
That was the first use of Solr, but it actually existed before
that
w/o any defined use other than to be a "plan B" alternative to
MySQL
based search servers (that's actually where some of the
parameter
names come from... the default /select URL instead of /search,
the
"rows" parameter, etc).

But you're right... some things like the replication strategy
were
designed (well, borrowed from Doug to be exact) with the idea
that it
would be OK to have slightly "stale" views of the data in the
range
of
minutes. It just made things easier/possible at the time. But
tons
of Solr and Lucene users want almost instantaneous visibility of
added
documents, if they can get it. It's hardly restricted to social
network applications.

Bottom line is that Solr aims to be a general enterprise search
platform, and getting as real-time as we can get, and as
scalable as
we can get are some of the top priorities going forward.

-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail:
java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


--
--Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
reply

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions