FAQ
Hi all,



I was reading one of the posting on concurrency and I reread section 9.1 in Lucene in Action which lead me to this question. I have 100,000 customers and I want to provide them with personal searching for their documents and sometimes to include company documents in that search.

1. 100,000 customers with 10-20 small document each.
2. Company 5,000 documents, specification, papers, research, etc.
3. Customers can search their own documents and company document.

P1: Do I provide an index for each customer and allow them multiple index searching, into company document when they need it?

OR

P2: Do I provide one large index for all my 100,000 customers, adding a field for customer ID so searching can be constrained, so they won’t/can’t search across other customer’s documents, and then categorize company documents so customers can do multiple index searches into company documents?

After writing this out I realize that P2 is probably the wiser choice, less complicated, but I would like to hear from other Luceners.

Lucene in Action is one of the best written books in my library of ~300 CS books. It ranks in completeness and clarity up there with works by David Geary, Martin Fowler, and other Hatcher greats like Java Development with Ant.

Thanks Otis and Erik.

Regards, Lawrence

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Paul Elschot at Mar 11, 2006 at 8:34 am

    On Saturday 11 March 2006 08:07, Lawrence wrote:
    Hi all,



    I was reading one of the posting on concurrency and I reread section 9.1 in
    Lucene in Action which lead me to this question. I have 100,000 customers and
    I want to provide them with personal searching for their documents and
    sometimes to include company documents in that search.
    1. 100,000 customers with 10-20 small document each.
    2. Company 5,000 documents, specification, papers, research, etc.
    3. Customers can search their own documents and company document.

    P1: Do I provide an index for each customer and allow them multiple index
    searching, into company document when they need it?
    OR

    P2: Do I provide one large index for all my 100,000 customers, adding a
    field for customer ID so searching can be constrained, so they won’t/can’t
    search across other customer’s documents, and then categorize company
    documents so customers can do multiple index searches into company documents?
    After writing this out I realize that P2 is probably the wiser choice, less
    complicated, but I would like to hear from other Luceners.

    In case you have many customers searching at the same time, compact filters
    can help reduce memory requirements:
    http://issues.apache.org/jira/browse/LUCENE-328
    A BitSet filter uses one bit per indexed document, and a compact filter uses
    one or three bytes per indexed document passing the filter.
    When there are 100 different customers searching in their own docs at the
    same time, assuming there are 100,000 * 20 docs in the customer index:
    - BitSet filters will use 100 * (100,000 * 20) / 8 bytes,
    - compact filters will use roughly 100 * 20 * 2 bytes.
    The ratio between these is roughly 100,000 / 16 or about 6000.

    Since the company docs will not need to be filtered, you can put these in a
    separate index, and write your own MultiSearcher that filters only on the
    customer index.

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Lu at Mar 11, 2006 at 5:10 pm
    I think it's best to have one small index for each customer, and one
    large index for company's index.

    Merging customers' contents with the main index will cost a lot of
    resources, slowing down systems, while actually not necessary. If
    indexing is done by batch job, there'll be a delay between content
    updated time and index refreshed time. This maybe acceptable for some
    cases, but usually for users' own content, they want to search it
    right away.

    With small individual customer index, indexing won't cost any time for
    10~20 small documents. Customers can search their content right after
    content is updated.

    Chris Lu
    -------------------------------
    Full-Text Search on Any Databases
    http://www.dbsight.net
    On 3/10/06, Lawrence wrote:
    Hi all,



    I was reading one of the posting on concurrency and I reread section 9.1 in Lucene in Action which lead me to this question. I have 100,000 customers and I want to provide them with personal searching for their documents and sometimes to include company documents in that search.

    1. 100,000 customers with 10-20 small document each.
    2. Company 5,000 documents, specification, papers, research, etc.
    3. Customers can search their own documents and company document.

    P1: Do I provide an index for each customer and allow them multiple index searching, into company document when they need it?

    OR

    P2: Do I provide one large index for all my 100,000 customers, adding a field for customer ID so searching can be constrained, so they won't/can't search across other customer's documents, and then categorize company documents so customers can do multiple index searches into company documents?

    After writing this out I realize that P2 is probably the wiser choice, less complicated, but I would like to hear from other Luceners.

    Lucene in Action is one of the best written books in my library of ~300 CS books. It ranks in completeness and clarity up there with works by David Geary, Martin Fowler, and other Hatcher greats like Java Development with Ant.

    Thanks Otis and Erik.

    Regards, Lawrence

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Mar 13, 2006 at 7:45 am
    Lawrence,

    Thanks for the LIA compliments.
    In addition to what Paul and Chris already mentioned, keep in mind open files (also covered in LIA). If you have 100K separate indices, that means a lot of open file descriptors. One common index doesn't have this problem. Separate indices are still possible, you just have to be smart about keeping track of used and unused indices and diligent about managing and freeing up resources.

    Otis


    ----- Original Message ----
    From: Lawrence <lucene@savant-is.com>
    To: java-user@lucene.apache.org
    Sent: Saturday, March 11, 2006 2:07:29 AM
    Subject: 100,000 indexes and what to do

    Hi all,



    I was reading one of the posting on concurrency and I reread section 9.1 in Lucene in Action which lead me to this question. I have 100,000 customers and I want to provide them with personal searching for their documents and sometimes to include company documents in that search.

    1. 100,000 customers with 10-20 small document each.
    2. Company 5,000 documents, specification, papers, research, etc.
    3. Customers can search their own documents and company document.

    P1: Do I provide an index for each customer and allow them multiple index searching, into company document when they need it?

    OR

    P2: Do I provide one large index for all my 100,000 customers, adding a field for customer ID so searching can be constrained, so they won’t/can’t search across other customer’s documents, and then categorize company documents so customers can do multiple index searches into company documents?

    After writing this out I realize that P2 is probably the wiser choice, less complicated, but I would like to hear from other Luceners.

    Lucene in Action is one of the best written books in my library of ~300 CS books. It ranks in completeness and clarity up there with works by David Geary, Martin Fowler, and other Hatcher greats like Java Development with Ant.

    Thanks Otis and Erik.

    Regards, Lawrence

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • John Powers at Mar 13, 2006 at 3:32 pm
    How does the information change in each of these customer's documents?
    I would think if they were very dynamic then updates to the single index
    would not be great for you. But if the updates were just now and then,
    then given the performance of lucene that the single index would be just
    fine. i was using a single index for most of my applications that
    use search till I ran into a commenting system for a blog like
    application. I realized that I didn't want to be updating the main
    records index for each comment, so I split that off. I've been pretty
    happy with that choice so far. So, anyway, it seems like one index
    would be easier and given that document count you are talking about, I
    am sure lucene will handle either fine as far as speed and mem.

    -----Original Message-----
    From: Lawrence
    Sent: Saturday, March 11, 2006 1:07 AM
    To: java-user@lucene.apache.org
    Subject: 100,000 indexes and what to do

    Hi all,



    I was reading one of the posting on concurrency and I reread section 9.1
    in Lucene in Action which lead me to this question. I have 100,000
    customers and I want to provide them with personal searching for their
    documents and sometimes to include company documents in that search.

    1. 100,000 customers with 10-20 small document each.
    2. Company 5,000 documents, specification, papers, research, etc.
    3. Customers can search their own documents and company document.

    P1: Do I provide an index for each customer and allow them multiple
    index searching, into company document when they need it?

    OR

    P2: Do I provide one large index for all my 100,000 customers, adding a
    field for customer ID so searching can be constrained, so they
    won't/can't search across other customer's documents, and then
    categorize company documents so customers can do multiple index searches
    into company documents?

    After writing this out I realize that P2 is probably the wiser choice,
    less complicated, but I would like to hear from other Luceners.

    Lucene in Action is one of the best written books in my library of ~300
    CS books. It ranks in completeness and clarity up there with works by
    David Geary, Martin Fowler, and other Hatcher greats like Java
    Development with Ant.

    Thanks Otis and Erik.

    Regards, Lawrence

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 11, '06 at 7:07a
activeMar 13, '06 at 3:32p
posts5
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase