FAQ
Hi,

We are currently running a Tomcat web application serving searches
over our Lucene index (10GB) on a single server machine (Dual 3GHz
CPU, 4GB RAM). Due to performance issues and to scale up to handle
more traffic/search requests, we are getting another server machine.

We are looking at two ways of scaling:
(1) duplicating the web application and index on the second machine
and load-balancing incoming users between the two servers.

(2) modifying our web application so that one machine will host our
web application (and associated MySQL database), while the other one
will host the Lucene index. The first machine would be dedicated to
our web application and database, while the second becomes our
dedicated Lucene search server. When users perform a search on the
website, the web application will send the request to the Lucene index
server, which will perform the search and return the results to the
web application.

We would like comments from users who have set up similar systems on
how you have accomplished (1) in your setups, and whether (2) is a
good choice for scaling.


Attached is a more complete RTF document outlining our architecture
and proposal. We appreciate your perusal and comments.

Regards,
CW

Search Discussions

  • Mathieu Lecarme at Jun 28, 2007 at 2:15 pm
    Server One handle website
    Server Two is a light version of tomcat wich handle Lucene Search

    In front, a lighttpd which use server two for /search, and server one
    for all others things

    You can add lucene server with round robin in lighttpd with this scheme.

    Careful with fault tolerance and index replication.

    M.

    Chun Wei Ho a écrit :
    Hi,

    We are currently running a Tomcat web application serving searches
    over our Lucene index (10GB) on a single server machine (Dual 3GHz
    CPU, 4GB RAM). Due to performance issues and to scale up to handle
    more traffic/search requests, we are getting another server machine.

    We are looking at two ways of scaling:
    (1) duplicating the web application and index on the second machine
    and load-balancing incoming users between the two servers.

    (2) modifying our web application so that one machine will host our
    web application (and associated MySQL database), while the other one
    will host the Lucene index. The first machine would be dedicated to
    our web application and database, while the second becomes our
    dedicated Lucene search server. When users perform a search on the
    website, the web application will send the request to the Lucene index
    server, which will perform the search and return the results to the
    web application.

    We would like comments from users who have set up similar systems on
    how you have accomplished (1) in your setups, and whether (2) is a
    good choice for scaling.


    Attached is a more complete RTF document outlining our architecture
    and proposal. We appreciate your perusal and comments.

    Regards,
    CW

    ------------------------------------------------------------------------

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Samuel LEMOINE at Jun 28, 2007 at 2:20 pm

    Chun Wei Ho a écrit :
    Hi,

    We are currently running a Tomcat web application serving searches
    over our Lucene index (10GB) on a single server machine (Dual 3GHz
    CPU, 4GB RAM). Due to performance issues and to scale up to handle
    more traffic/search requests, we are getting another server machine.

    We are looking at two ways of scaling:
    (1) duplicating the web application and index on the second machine
    and load-balancing incoming users between the two servers.

    (2) modifying our web application so that one machine will host our
    web application (and associated MySQL database), while the other one
    will host the Lucene index. The first machine would be dedicated to
    our web application and database, while the second becomes our
    dedicated Lucene search server. When users perform a search on the
    website, the web application will send the request to the Lucene index
    server, which will perform the search and return the results to the
    web application.

    We would like comments from users who have set up similar systems on
    how you have accomplished (1) in your setups, and whether (2) is a
    good choice for scaling.


    Attached is a more complete RTF document outlining our architecture
    and proposal. We appreciate your perusal and comments.

    Regards,
    CW

    ------------------------------------------------------------------------

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    I'm acutely interrested by this issue too, as I'm working on distributed
    architecture of Lucene. I'm only at the very beginning of my study so
    that I can't help you much, but Hadoop maybe could fit to your
    requirements. It's a sub-project of Lucene aiming to parallelise Lucene.
    See http://lucene.apache.org/hadoop/about.html but I don't know wether
    it scales well to very small clusters...

    About your attached, I couldn't access it, it was only a "Partie 1.2"
    file containing this text:

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    Cordially,

    Samuel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mathieu Lecarme at Jun 28, 2007 at 2:41 pm

    Samuel LEMOINE a écrit :
    I'm acutely interrested by this issue too, as I'm working on
    distributed architecture of Lucene. I'm only at the very beginning of
    my study so that I can't help you much, but Hadoop maybe could fit to
    your requirements. It's a sub-project of Lucene aiming to parallelise
    Lucene.
    See http://lucene.apache.org/hadoop/about.html but I don't know wether
    it scales well to very small clusters...
    Reading from index replicated in several server is not hard, the writing
    (and locking) part is harder.
    The way choosen by technorati's guys is one computer to index, and rsync
    replication with cp and mv commit in the search cluster.
    If you need more power for indexing, then, use nutch.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Jun 28, 2007 at 5:39 pm
    Basically you need to separate your web app from your searching, for a
    scalable solution. Searching is a different concern. You can develop more
    kinds of search when new requirement comes in.

    Technorati's way is very similar to one of DBSight configuration. One
    machine is dedicated for indexing, and one or several other machines are
    dedicated for searching. Searching nodes subscribe to the indexing node.
    Transferring the index is pretty quick. This way scales well.

    (Database)=crawl=>(Indexing node)=replicating index=>(Searching
    nodes)==>end user query

    However, if your index is huge, you may need to change your index structure
    to split indexing nodes into several, and one Indexing node only serves one
    specific kind of index. This is kind of vertically slicing the index and
    scale it.

    --
    Chris Lu
    -------------------------
    Instant Scalable Full-Text Search On Any Database/Application
    site: http://www.dbsight.net
    demo: http://search.dbsight.com
    Lucene Database Search in 3 minutes:
    http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes

    On 6/28/07, Mathieu Lecarme wrote:

    Samuel LEMOINE a écrit :
    I'm acutely interrested by this issue too, as I'm working on
    distributed architecture of Lucene. I'm only at the very beginning of
    my study so that I can't help you much, but Hadoop maybe could fit to
    your requirements. It's a sub-project of Lucene aiming to parallelise
    Lucene.
    See http://lucene.apache.org/hadoop/about.html but I don't know wether
    it scales well to very small clusters...
    Reading from index replicated in several server is not hard, the writing
    (and locking) part is harder.
    The way choosen by technorati's guys is one computer to index, and rsync
    replication with cp and mv commit in the search cluster.
    If you need more power for indexing, then, use nutch.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Jun 28, 2007 at 5:14 pm
    Hadoop is not designed for this type of scenario.

    Have a look at Solr (http://lucene.apache.org/solr), this is pretty
    much one of it's main use cases. I think it will do what you need to
    do and will more than likely work w/ a minimal of configuration on
    your existing index (but don't hold me to that statement).

    Also, have you done profiling on your application such that you are
    sure moving Lucene off the machine is going to help that much?

    Cheers,
    Grant

    ps, the mailing lists strips attachments.
    On Jun 28, 2007, at 10:19 AM, Samuel LEMOINE wrote:

    Chun Wei Ho a écrit :
    Hi,

    We are currently running a Tomcat web application serving searches
    over our Lucene index (10GB) on a single server machine (Dual 3GHz
    CPU, 4GB RAM). Due to performance issues and to scale up to handle
    more traffic/search requests, we are getting another server machine.

    We are looking at two ways of scaling:
    (1) duplicating the web application and index on the second machine
    and load-balancing incoming users between the two servers.

    (2) modifying our web application so that one machine will host our
    web application (and associated MySQL database), while the other one
    will host the Lucene index. The first machine would be dedicated to
    our web application and database, while the second becomes our
    dedicated Lucene search server. When users perform a search on the
    website, the web application will send the request to the Lucene
    index
    server, which will perform the search and return the results to the
    web application.

    We would like comments from users who have set up similar systems on
    how you have accomplished (1) in your setups, and whether (2) is a
    good choice for scaling.


    Attached is a more complete RTF document outlining our architecture
    and proposal. We appreciate your perusal and comments.

    Regards,
    CW

    ---------------------------------------------------------------------
    ---

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    I'm acutely interrested by this issue too, as I'm working on
    distributed architecture of Lucene. I'm only at the very beginning
    of my study so that I can't help you much, but Hadoop maybe could
    fit to your requirements. It's a sub-project of Lucene aiming to
    parallelise Lucene.
    See http://lucene.apache.org/hadoop/about.html but I don't know
    wether it scales well to very small clusters...

    About your attached, I couldn't access it, it was only a "Partie
    1.2" file containing this text:

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    Cordially,

    Samuel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    Center for Natural Language Processing
    http://www.cnlp.org/tech/lucene.asp

    Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chun Wei Ho at Jul 8, 2007 at 3:08 am
    Thanks for your comments and suggestions everyone :)

    It looks like the general trend is to be in favour of (2) splitting
    the frontend web application and the searching application.

    Solr looks a lot like what we would liked, but unfortunately we
    finished our application a while before Solr initially became
    available, and I honestly say reading up on Solr's architecture and
    how it segregates the searching layer from the frontend (so you can
    add further dedicated search machines) was our idea for (2).

    Splitting our index looks like a way to go and I am curious as to how
    people are splitting their indexes? A split by category, or a split by
    time (say divide 2 months worthed of index additions into 8 indexes of
    a week each) and how would the system cope with the multiple indexes?


    On 6/29/07, Grant Ingersoll wrote:
    Hadoop is not designed for this type of scenario.

    Have a look at Solr (http://lucene.apache.org/solr), this is pretty
    much one of it's main use cases. I think it will do what you need to
    do and will more than likely work w/ a minimal of configuration on
    your existing index (but don't hold me to that statement).

    Also, have you done profiling on your application such that you are
    sure moving Lucene off the machine is going to help that much?

    Cheers,
    Grant

    ps, the mailing lists strips attachments.
    On Jun 28, 2007, at 10:19 AM, Samuel LEMOINE wrote:

    Chun Wei Ho a écrit :
    Hi,

    We are currently running a Tomcat web application serving searches
    over our Lucene index (10GB) on a single server machine (Dual 3GHz
    CPU, 4GB RAM). Due to performance issues and to scale up to handle
    more traffic/search requests, we are getting another server machine.

    We are looking at two ways of scaling:
    (1) duplicating the web application and index on the second machine
    and load-balancing incoming users between the two servers.

    (2) modifying our web application so that one machine will host our
    web application (and associated MySQL database), while the other one
    will host the Lucene index. The first machine would be dedicated to
    our web application and database, while the second becomes our
    dedicated Lucene search server. When users perform a search on the
    website, the web application will send the request to the Lucene
    index
    server, which will perform the search and return the results to the
    web application.

    We would like comments from users who have set up similar systems on
    how you have accomplished (1) in your setups, and whether (2) is a
    good choice for scaling.


    Attached is a more complete RTF document outlining our architecture
    and proposal. We appreciate your perusal and comments.

    Regards,
    CW

    ---------------------------------------------------------------------
    ---

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    I'm acutely interrested by this issue too, as I'm working on
    distributed architecture of Lucene. I'm only at the very beginning
    of my study so that I can't help you much, but Hadoop maybe could
    fit to your requirements. It's a sub-project of Lucene aiming to
    parallelise Lucene.
    See http://lucene.apache.org/hadoop/about.html but I don't know
    wether it scales well to very small clusters...

    About your attached, I couldn't access it, it was only a "Partie
    1.2" file containing this text:

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    Cordially,

    Samuel

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    Center for Natural Language Processing
    http://www.cnlp.org/tech/lucene.asp

    Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 28, '07 at 1:50p
activeJul 8, '07 at 3:08a
posts7
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase