FAQ
If we have to index a lot of documents, is there a way to divide the
documents into multiple sets and index them on multiple machines in
parallel, and then merge the resulting indexes back into a single
machine? If yes, will the result be logically equivalent to indexing all
the documents on a single machine?



Thanks,

-gc

Search Discussions

  • Danil ŢORIN at Jun 30, 2011 at 9:35 am
    It depends....

    If all documents are distinct then, yeah, go for it.

    If you have multiple versions of same document in your data and you
    only want to index the latest version...then you need a clever way to
    split data to make sure that all versions of document will be indexed
    on same host, and you won't have duplicates later.

    But my biggest concern is: if your index is that big that you need to
    index it on different hosts, are you sure you want it to be combine in
    a single index?
    Maybe it's a good idea to partition it?
    On Thu, Jun 30, 2011 at 12:12, Guru Chandar wrote:


    If we have to index a lot of documents, is there a way to divide the
    documents into multiple sets and index them on multiple machines in
    parallel, and then merge the resulting indexes back into a single
    machine? If yes, will the result be logically equivalent to indexing all
    the documents on a single machine?



    Thanks,

    -gc


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Guru Chandar at Jun 30, 2011 at 9:44 am
    Thanks for the response. The documents are all distinct. My (limited) understanding on partitioning the indexes will lead to results being different from the case where you have all in one partition, due to Lucene currently not supporting distributed idf. Is this correct? Is there a way to make it work seamlessly?

    Regards,
    -gc


    -----Original Message-----
    From: Danil ŢORIN
    Sent: Thursday, June 30, 2011 3:04 PM
    To: java-user@lucene.apache.org
    Subject: Re: distributing the indexing process

    It depends....

    If all documents are distinct then, yeah, go for it.

    If you have multiple versions of same document in your data and you
    only want to index the latest version...then you need a clever way to
    split data to make sure that all versions of document will be indexed
    on same host, and you won't have duplicates later.

    But my biggest concern is: if your index is that big that you need to
    index it on different hosts, are you sure you want it to be combine in
    a single index?
    Maybe it's a good idea to partition it?
    On Thu, Jun 30, 2011 at 12:12, Guru Chandar wrote:


    If we have to index a lot of documents, is there a way to divide the
    documents into multiple sets and index them on multiple machines in
    parallel, and then merge the resulting indexes back into a single
    machine? If yes, will the result be logically equivalent to indexing all
    the documents on a single machine?



    Thanks,

    -gc


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sanne Grinovero at Jun 30, 2011 at 1:23 pm
    Hello,
    you could have each node build a separate index, and then merge the
    result back in a single consistent index using

    org.apache.lucene.index.IndexWriter.addIndexes(Directory...)

    Regards,
    Sanne

    2011/6/30 Guru Chandar <Guru.Chandar@consona.com>:
    Thanks for the response. The documents are all distinct. My (limited) understanding on partitioning the indexes will lead to results being different from the case where you have all in one partition, due to Lucene currently not supporting distributed idf. Is this correct? Is there a way to make it work seamlessly?

    Regards,
    -gc


    -----Original Message-----
    From: Danil ŢORIN
    Sent: Thursday, June 30, 2011 3:04 PM
    To: java-user@lucene.apache.org
    Subject: Re: distributing the indexing process

    It depends....

    If all documents are distinct then, yeah, go for it.

    If you have multiple versions of same document in your data and you
    only want to index the latest version...then you need a clever way to
    split data to make sure that all versions of document will be indexed
    on same host, and you won't have duplicates later.

    But my biggest concern is: if your index is that big that you need to
    index it on different hosts, are you sure you want it to be combine in
    a single index?
    Maybe it's a good idea to partition it?
    On Thu, Jun 30, 2011 at 12:12, Guru Chandar wrote:


    If we have to index a lot of documents, is there a way to divide the
    documents into multiple sets and index them on multiple machines in
    parallel, and then merge the resulting indexes back into a single
    machine? If yes, will the result be logically equivalent to indexing all
    the documents on a single machine?



    Thanks,

    -gc


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Jun 30, 2011 at 1:37 pm

    On Thu, 2011-06-30 at 11:45 +0200, Guru Chandar wrote:
    Thanks for the response. The documents are all distinct. My (limited)
    understanding on partitioning the indexes will lead to results being
    different from the case where you have all in one partition, due to
    Lucene currently not supporting distributed idf. Is this correct?
    Yes, that is a prime blocker for proper distributed search with Lucene.
    Is there a way to make it work seamlessly?
    There's some work being done with Solr, but it is not stable:
    https://issues.apache.org/jira/browse/SOLR-1632

    We're experimenting with distributed idf by assigning different weights
    to the queries sent to different searchers, based on term statistics
    from the different indexes. However, it is quite a hack and one we've
    done because one of the indexes is external and out of our control.


    3 years ago (or was it 4?) we also used distributed indexing with
    searching being done on a single merged index. It worked surprisingly
    well, but it was replaced by indexing on a single machine, when we
    finally got around to doing a proper profiling of our indexing process
    and removed some serious bottlenecks.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jul 7, 2011 at 3:50 am
    We've used Hadoop MapReduce with Solr to parallelize indexing for a customer and that brought down their multi-hour indexing process down to a couple of minutes.  There is/was also Lucene-level contrib in Hadoop that makes use of MapReduce to parallelize indexing.

    Otis

    ----
    Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
    Lucene ecosystem search :: http://search-lucene.com/

    ----- Original Message -----
    From: Guru Chandar <Guru.Chandar@consona.com>
    To: java-user@lucene.apache.org
    Cc:
    Sent: Thursday, June 30, 2011 5:12 AM
    Subject: distributing the indexing process



    If we have to index a lot of documents, is there a way to divide the
    documents into multiple sets and index them on multiple machines in
    parallel, and then merge the resulting indexes back into a single
    machine? If yes, will the result be logically equivalent to indexing all
    the documents on a single machine?



    Thanks,

    -gc
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 30, '11 at 9:11a
activeJul 7, '11 at 3:50a
posts6
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase