FAQ
Hi,

I am working on a project where full-text search gets slower as the number of (group of) documents increases.

Here is a simplified description of the project: It is an email system, so each user has its emails and can search for them using Lucene.net.
So logically, it should be possible to implement it so that its performance doesn't (really) drop as the number of users increases. The speed of a search should be based on the amount of documents that the logged user has.

My current implementation is to have a property OwnerId in each document and use it as a clause in the searches. Eg: OwnerId:123 AND MailContent:Something
However, this doesn't work...

The extreme solution would be to completely dissociate each user's index. But that would make my implementation harder to maintain.

Do you have any suggestions?

Pierre Henri.

Search Discussions

  • Pavlo Zahozhenko at May 2, 2009 at 10:10 pm
    The simplest solution as I see it is "sharding" your index, for example,
    creating index for all users, whose email address starts with "a" letter,
    another index for users with email addresses, starting with "b" letter etc
    (you may associate a few rare latters with a single index to make your index
    list more or less evenly distributed). Then, when user performes search, you
    will not search the whole MultiIndex, but only the index, containing this
    user's emails.
    If such sharding is not enough, you may partition your index using different
    criterium, e.g. 1000 users per index, ordered by ID.

    As far as I'm concearned, this is a common practice for large indices.

    Pavlo
    Zahozhenko.

    2009/5/2 Pierre Henri Kuaté <phkuate@yahoo.fr>
    Hi,

    I am working on a project where full-text search gets slower as the number
    of (group of) documents increases.

    Here is a simplified description of the project: It is an email system, so
    each user has its emails and can search for them using Lucene.net.
    So logically, it should be possible to implement it so that its performance
    doesn't (really) drop as the number of users increases. The speed of a
    search should be based on the amount of documents that the logged user has.

    My current implementation is to have a property OwnerId in each document
    and use it as a clause in the searches. Eg: OwnerId:123 AND
    MailContent:Something
    However, this doesn't work...

    The extreme solution would be to completely dissociate each user's index.
    But that would make my implementation harder to maintain.

    Do you have any suggestions?

    Pierre Henri.


  • Digy at May 3, 2009 at 10:28 am
    Can it be related with your code? Since Lucene.Net can handle very large
    indeces easily.
    Have you tried the search speed improvement techniques in
    http://wiki.apache.org/jakarta-lucene/ImproveSearchingSpeed

    My current implementation is to have a property OwnerId in each document
    and use it as a clause in the searches. Eg: OwnerId:123 AND
    MailContent:Something
    However, this doesn't work...
    I don't understand why this didn't work.

    DIGY

    -----Original Message-----
    From: Pierre Henri Kuaté
    Sent: Saturday, May 02, 2009 11:02 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Designing an index with constant speed no matter how big

    Hi,

    I am working on a project where full-text search gets slower as the number
    of (group of) documents increases.

    Here is a simplified description of the project: It is an email system, so
    each user has its emails and can search for them using Lucene.net.
    So logically, it should be possible to implement it so that its performance
    doesn't (really) drop as the number of users increases. The speed of a
    search should be based on the amount of documents that the logged user has.

    My current implementation is to have a property OwnerId in each document and
    use it as a clause in the searches. Eg: OwnerId:123 AND
    MailContent:Something
    However, this doesn't work...

    The extreme solution would be to completely dissociate each user's index.
    But that would make my implementation harder to maintain.

    Do you have any suggestions?

    Pierre Henri.
  • Nitin Shiralkar at May 3, 2009 at 12:26 pm
    Hi Pierre,

    We have implemented out search engine in similar fashion and it is working absolutely fine. Few questions:

    1. Do you sort on any field while searching? If yes, then remove that and check out.
    2. How many results are retrieved while searching? If you are retrieving more than 100 documents, then use HitCollector method.


    -----Original Message-----
    From: Digy
    Sent: Sunday, May 03, 2009 3:57 PM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Designing an index with constant speed no matter how big

    Can it be related with your code? Since Lucene.Net can handle very large
    indeces easily.
    Have you tried the search speed improvement techniques in
    http://wiki.apache.org/jakarta-lucene/ImproveSearchingSpeed

    My current implementation is to have a property OwnerId in each document
    and use it as a clause in the searches. Eg: OwnerId:123 AND
    MailContent:Something
    However, this doesn't work...
    I don't understand why this didn't work.

    DIGY

    -----Original Message-----
    From: Pierre Henri Kuaté
    Sent: Saturday, May 02, 2009 11:02 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Designing an index with constant speed no matter how big

    Hi,

    I am working on a project where full-text search gets slower as the number
    of (group of) documents increases.

    Here is a simplified description of the project: It is an email system, so
    each user has its emails and can search for them using Lucene.net.
    So logically, it should be possible to implement it so that its performance
    doesn't (really) drop as the number of users increases. The speed of a
    search should be based on the amount of documents that the logged user has.

    My current implementation is to have a property OwnerId in each document and
    use it as a clause in the searches. Eg: OwnerId:123 AND
    MailContent:Something
    However, this doesn't work...

    The extreme solution would be to completely dissociate each user's index.
    But that would make my implementation harder to maintain.

    Do you have any suggestions?

    Pierre Henri.
  • Pierre Henri Kuaté at May 5, 2009 at 12:29 pm
    These are very useful suggestions; I will investigate all the tips in the wiki of Lucene.

    Btw, when I said that "this doesn't work", referring to: OwnerId:123 AND MailContent:Something
    I meant that it was still very slow.

    My application doesn't sort using Lucene and generally retrieves less than 100 docs.

    I think the most promising solution is sharding...

    Thanks,
    Pierre Henri.


    --- On Sun, 5/3/09, Nitin Shiralkar wrote:

    From: Nitin Shiralkar <nitins@coreobjects.com>
    Subject: RE: Designing an index with constant speed no matter how big
    To: "lucene-net-user@incubator.apache.org" <lucene-net-user@incubator.apache.org>
    Date: Sunday, May 3, 2009, 3:27 PM

    Hi Pierre,

    We have implemented out search engine in similar fashion and it is working absolutely fine. Few questions:

    1. Do you sort on any field while searching? If yes, then remove that and check out.
    2. How many results are retrieved while searching? If you are retrieving more than 100 documents, then use HitCollector method.


    -----Original Message-----
    From: Digy
    Sent: Sunday, May 03, 2009 3:57 PM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Designing an index with constant speed no matter how big

    Can it be related with your code? Since Lucene.Net can handle very large
    indeces easily.
    Have you tried the search speed improvement techniques in
    http://wiki.apache.org/jakarta-lucene/ImproveSearchingSpeed

    My current implementation is to have a property OwnerId in each document
    and use it as a clause in the searches. Eg: OwnerId:123 AND
    MailContent:Something
    However, this doesn't work...
    I don't understand why this didn't work.

    DIGY

    -----Original Message-----
    From: Pierre Henri Kuaté
    Sent: Saturday, May 02, 2009 11:02 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Designing an index with constant speed no matter how big

    Hi,

    I am working on a project where full-text search gets slower as the number
    of (group of) documents increases.

    Here is a simplified description of the project: It is an email system, so
    each user has its emails and can search for them using Lucene.net.
    So logically, it should be possible to implement it so that its performance
    doesn't (really) drop as the number of users increases. The speed of a
    search should be based on the amount of documents that the logged user has.

    My current implementation is to have a property OwnerId in each document and
    use it as a clause in the searches. Eg: OwnerId:123 AND
    MailContent:Something
    However, this doesn't work...

    The extreme solution would be to completely dissociate each user's index.
    But that would make my implementation harder to maintain.

    Do you have any suggestions?

    Pierre Henri.
  • Moray McConnachie at May 5, 2009 at 12:58 pm
    If you are going to shard, and of course depending on the profile of the queries you expect to service, consider designing your shards around mail date. I read somewhere that for mailboxes 90% of the activity is within 10% of the mail items, with most recent mails being the 10%. This would be particularly attractive if you expect to service a considerable number of cross-user searches.

    You might be able to use as little as a couple of "hot" indices and an archive index.

    Yours,
    Moray
    -------------------------------------
    Moray McConnachie
    Head of IS +44 1865 261 600
    Oxford Analytica http://www.oxan.com

    -----Original Message-----
    From: Pierre Henri Kuaté
    Sent: 05 May 2009 13:29
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Designing an index with constant speed no matter how big

    These are very useful suggestions; I will investigate all the tips in the wiki of Lucene.

    Btw, when I said that "this doesn't work", referring to: OwnerId:123 AND MailContent:Something I meant that it was still very slow.

    My application doesn't sort using Lucene and generally retrieves less than 100 docs.

    I think the most promising solution is sharding...

    Thanks,
    Pierre Henri.


    --- On Sun, 5/3/09, Nitin Shiralkar wrote:

    From: Nitin Shiralkar <nitins@coreobjects.com>
    Subject: RE: Designing an index with constant speed no matter how big
    To: "lucene-net-user@incubator.apache.org" <lucene-net-user@incubator.apache.org>
    Date: Sunday, May 3, 2009, 3:27 PM

    Hi Pierre,

    We have implemented out search engine in similar fashion and it is working absolutely fine. Few questions:

    1. Do you sort on any field while searching? If yes, then remove that and check out.
    2. How many results are retrieved while searching? If you are retrieving more than 100 documents, then use HitCollector method.


    -----Original Message-----
    From: Digy
    Sent: Sunday, May 03, 2009 3:57 PM
    To: lucene-net-user@incubator.apache.org
    Subject: RE: Designing an index with constant speed no matter how big

    Can it be related with your code? Since Lucene.Net can handle very large indeces easily.
    Have you tried the search speed improvement techniques in http://wiki.apache.org/jakarta-lucene/ImproveSearchingSpeed

    My current implementation is to have a property OwnerId in each
    document
    and use it as a clause in the searches. Eg: OwnerId:123 AND MailContent:Something
    However, this doesn't work...
    I don't understand why this didn't work.

    DIGY

    -----Original Message-----
    From: Pierre Henri Kuaté
    Sent: Saturday, May 02, 2009 11:02 PM
    To: lucene-net-user@incubator.apache.org
    Subject: Designing an index with constant speed no matter how big

    Hi,

    I am working on a project where full-text search gets slower as the number of (group of) documents increases.

    Here is a simplified description of the project: It is an email system, so each user has its emails and can search for them using Lucene.net.
    So logically, it should be possible to implement it so that its performance doesn't (really) drop as the number of users increases. The speed of a search should be based on the amount of documents that the logged user has.

    My current implementation is to have a property OwnerId in each document and use it as a clause in the searches. Eg: OwnerId:123 AND MailContent:Something However, this doesn't work...

    The extreme solution would be to completely dissociate each user's index.
    But that would make my implementation harder to maintain.

    Do you have any suggestions?

    Pierre Henri.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouplucene-net-user @
categorieslucene
postedMay 2, '09 at 8:03p
activeMay 5, '09 at 12:58p
posts6
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase