FAQ
Hi,
I've been using lucene for a project and it works great on the one dev.
machine. Next step is to investigate the best method of deploying lucene so
that multiple web servers can access the lucene directory of indexes.

I see four potential options:

1) Each web server indexes the content separately. This will potentially
cause different web servers to have slightly different indexes at any given
time and also duplicates the work load of indexing the content
2) Using rsync (or a similar tool) to regularly update a local lucene index
directory on each web server. I imagine there will be locking issues that
need to be resolved here.
3) Using a network file system that all the web servers can access. I don't
have much experience in this area, but potentially latency on searches will
be high?
4) Some alternative lucene specific solution that I haven't found in the
wiki / lucene documentation.

The indexes aim to be as real-time as possible, I currently update my
IndexReaders in a background thread every 20 seconds.

Does anyone have any recommendations on the best approach?

Cheers,
Chris

Search Discussions

  • Jason Rutherglen at Oct 6, 2009 at 11:04 pm
    Chris,

    It sounds like you're on the right track. Have you looked at
    Solr which uses the rsync/Java replication method you mentioned?
    Replication and near realtime in Solr aren't quite there yet,
    however it wouldn't be too hard to add it.

    -J
    On Tue, Oct 6, 2009 at 3:57 PM, Chris Were wrote:
    Hi,
    I've been using lucene for a project and it works great on the one dev.
    machine. Next step is to investigate the best method of deploying lucene so
    that multiple web servers can access the lucene directory of indexes.

    I see four potential options:

    1) Each web server indexes the content separately. This will potentially
    cause different web servers to have slightly different indexes at any given
    time and also duplicates the work load of indexing the content
    2) Using rsync (or a similar tool) to regularly update a local lucene index
    directory on each web server. I imagine there will be locking issues that
    need to be resolved here.
    3) Using a network file system that all the web servers can access. I don't
    have much experience in this area, but potentially latency on searches will
    be high?
    4) Some alternative lucene specific solution that I haven't found in the
    wiki / lucene documentation.

    The indexes aim to be as real-time as possible, I currently update my
    IndexReaders in a background thread every 20 seconds.

    Does anyone have any recommendations on the best approach?

    Cheers,
    Chris
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • No spam at Oct 7, 2009 at 1:50 am
    Have you investigated using Terracotta / Compass? We need real-time updates
    across the index using multiple web servers. I recently got this up and
    running and we're going to be doing some performance testing. It's very
    easy, essentially you just replace your FSDirectoryProvider with a
    TerracottaDirectoryProvider that wraps the compass TerracottaDirectory.

    http://relation.to/Bloggers/HibernateSearchClusteringWithTerracotta
  • Jake Mannix at Oct 7, 2009 at 2:08 am
    Hi Chris,

    Answering your question depends in part whether your kind of scalability
    is dependent on sharding (your index size is expected to grow to very large)
    or just replication (your query load is large, and you need failover). It
    sounds like you're mostly thinking about the latter.

    1) Each web server indexes the content separately. This will potentially
    cause different web servers to have slightly different indexes at any given
    time and also duplicates the work load of indexing the content
    If your indexing throughput is small enough, this can be a perfectly
    simple
    way to do this. Just make sure you're not DOS'ing your DB if you're
    indexing
    via direct DB queries (ie. if you have a message queue or something else
    firing off indexing events, instead of all web servers firing off
    simultaneous
    identical DB queries from different places. DB caching will deal with this
    pretty well, but you still need to be careful).

    2) Using rsync (or a similar tool) to regularly update a local lucene index
    directory on each web server. I imagine there will be locking issues that
    need to be resolved here.
    rsync can work great, and and as Jason said, it is how Solr works and it
    scales great. Locking isn't really a worry, because in this setup, the
    slaves
    are read-only in this case, so they won't compete with rsync for write
    access.

    3) Using a network file system that all the web servers can access. I don't
    have much experience in this area, but potentially latency on searches will
    be high?
    Generally this is a really bad idea, and can lead to really hard-to-debug
    performance problems.

    4) Some alternative lucene specific solution that I haven't found in the
    wiki / lucene documentation.
    The indexes aim to be as real-time as possible, I currently update my
    IndexReaders in a background thread every 20 seconds.
    This is where things diverge from common practice, especially if you at some
    point decide to lower that to 10 or 5 seconds.

    In this case, I'd say that if you have a reliable, scalable queueing system
    for
    getting indexing events distributed to all of your servers, then indexing on
    all replicas simultaneously can be the best way to have maximally realtime
    search, either using the very new feature of "near realtime search" in
    Lucene 2.9, by using something home-brewed which indexes in memory
    and on disk simultaneously, or using Zoie ( http://zoie.googlecode.com ),
    an open-source realtime search library built on top of Lucene which we at
    LinkedIn built and have been using in production for serving tens of
    millions of queries a day in real time (meaning milliseconds, even under
    fairly high indexing load) for the past year.

    -jake mannix
  • Chris Were at Oct 7, 2009 at 9:51 am
    Thanks for all the excellent replies.
    Lots of great of software mentioned that I'd never heard of -- and I thought
    I'd Google'd this subject to death already!

    Cheers,
    Chris.
  • Chris Were at Oct 9, 2009 at 4:33 am

    In this case, I'd say that if you have a reliable, scalable queueing system
    for
    getting indexing events distributed to all of your servers, then indexing
    on
    all replicas simultaneously can be the best way to have maximally realtime
    search, either using the very new feature of "near realtime search" in
    Lucene 2.9, by using something home-brewed which indexes in memory
    and on disk simultaneously, or using Zoie ( http://zoie.googlecode.com ),
    an open-source realtime search library built on top of Lucene which we at
    LinkedIn built and have been using in production for serving tens of
    millions of queries a day in real time (meaning milliseconds, even under
    fairly high indexing load) for the past year.

    Zoie looks very close to what I'm after, however my whole app is written in
    Python and uses PyLucene, so there is a non-trivial amount of work to make
    things work with Zoie.

    I'm currently experiencing a bottleneck when optimising my index. How is
    this handled / solved with Zoie?

    Cheers,
    Chris
  • Jake Mannix at Oct 9, 2009 at 5:11 am

    On Thu, Oct 8, 2009 at 9:32 PM, Chris Were wrote:

    Zoie looks very close to what I'm after, however my whole app is written in
    Python and uses PyLucene, so there is a non-trivial amount of work to make
    things work with Zoie.
    I've never used PyLucene before, but since it's a wrapper, plugging in zoie
    should be doable - instead of accessing the raw Lucene API via PyLucene,
    you could instantiate a proj.zoie.impl.indexing.ZoieSystem, then it would
    take care of the IndexReader management for you, and you'd just ask it
    for IndexReaders and you'd pass it indexing events.

    Not sure how JVM interactions are with Python. Whenever I use other
    languages,
    I try and turn that paradigm upside down - use Jython and JRuby and Scala,
    and live *inside* of the JVM. Then everything is nice and happy. :)

    I'm currently experiencing a bottleneck when optimising my index. How is
    this handled / solved with Zoie?
    In a realtime environment, optimizing your index isn't always the best thing
    to do - optimize down to small number of segments (with the under-used
    IndexWriter.optimize(int maxNumSegments) call) if you need to, but
    optimizing down to 1 segment means you completely blow away your
    OS disk cache, as all of your old files are gone and you have one huge new
    ginormous one. Keeping a balanced set of segments which only a couple
    of the big ones merge together occasionally (and the small ones are
    continously
    being integrated into bigger ones) keeps you only blowing away parts of
    the IO cache at a time.

    But to answer your question more directly: with Zoie, the disk index is only

    refreshed every 15 minutes or so (this is configurable), because you have
    a RAMDirectory serving realtime results as well. Any background segment
    merging and optimizing can be done on the disk index during this 15 minute
    interval, and typical use cases keep the target number of segments at
    a fixed moderately small number (5, 10 segments or so).

    The optimize() call takes progressively more time the fewer segments you
    have left, and much of the time is in the final 2 to 1 segment merge, so if
    you
    never do that last step, things go a lot faster - you can try this without
    zoie - replace your optimize() calls with optimize(2, or 5, or 10), and see
    a) how long this takes instead, and
    b) what effect on your latency this has (it will cost you something - but
    how much: check and see!)

    If you end up trying zoie in PyLucene somehow, please post about it,
    we'd love to hear about it used in different sorts of environments.

    -jake
  • Chris Were at Oct 9, 2009 at 6:13 am
    Hi Jake,
    Thanks for the great insight and suggestions.

    I will investigate different optimize() levels and see if that helps my
    particular use case -- if not I'll be considering the Zoie route and let you
    know how I get on.

    Cheers,
    Chris
    On Fri, Oct 9, 2009 at 3:40 PM, Jake Mannix wrote:


    On Thu, Oct 8, 2009 at 9:32 PM, Chris Were wrote:

    Zoie looks very close to what I'm after, however my whole app is written
    in
    Python and uses PyLucene, so there is a non-trivial amount of work to make
    things work with Zoie.
    I've never used PyLucene before, but since it's a wrapper, plugging in zoie

    should be doable - instead of accessing the raw Lucene API via PyLucene,
    you could instantiate a proj.zoie.impl.indexing.ZoieSystem, then it would
    take care of the IndexReader management for you, and you'd just ask it
    for IndexReaders and you'd pass it indexing events.

    Not sure how JVM interactions are with Python. Whenever I use other
    languages,
    I try and turn that paradigm upside down - use Jython and JRuby and Scala,
    and live *inside* of the JVM. Then everything is nice and happy. :)

    I'm currently experiencing a bottleneck when optimising my index. How is
    this handled / solved with Zoie?
    In a realtime environment, optimizing your index isn't always the best
    thing
    to do - optimize down to small number of segments (with the under-used
    IndexWriter.optimize(int maxNumSegments) call) if you need to, but
    optimizing down to 1 segment means you completely blow away your
    OS disk cache, as all of your old files are gone and you have one huge new
    ginormous one. Keeping a balanced set of segments which only a couple
    of the big ones merge together occasionally (and the small ones are
    continously
    being integrated into bigger ones) keeps you only blowing away parts of
    the IO cache at a time.

    But to answer your question more directly: with Zoie, the disk index is
    only
    refreshed every 15 minutes or so (this is configurable), because you have
    a RAMDirectory serving realtime results as well. Any background segment
    merging and optimizing can be done on the disk index during this 15 minute
    interval, and typical use cases keep the target number of segments at
    a fixed moderately small number (5, 10 segments or so).

    The optimize() call takes progressively more time the fewer segments you
    have left, and much of the time is in the final 2 to 1 segment merge, so if
    you
    never do that last step, things go a lot faster - you can try this without
    zoie - replace your optimize() calls with optimize(2, or 5, or 10), and see
    a) how long this takes instead, and
    b) what effect on your latency this has (it will cost you something - but
    how much: check and see!)

    If you end up trying zoie in PyLucene somehow, please post about it,
    we'd love to hear about it used in different sorts of environments.

    -jake

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 6, '09 at 10:57p
activeOct 9, '09 at 6:13a
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase