FAQ
Hello, we are trying to implement a query type for Lucene (with eventual
target being Solr) where the query string passed in needs to be "filtered"
through a large list of document IDs per user. We can't store the user ID
information in the lucene index per document so we were planning to pull the
list of documents owned by user X from a key-value store at query time and
then build some sort of filter in memory before doing the Lucene/Solr query.
For example:

content:"cars" user_id:X567

would first pull the list of docIDs that user_id:X567 has "access" to from a
keyvalue store and then we'd query the main index with content:"cars" but
only allow the docIDs that came back to be part of the response. The list of
docIDs can near the hundreds of thousands.

What should I be looking at to implement such a feature?

Thank you
Martin

Search Discussions

  • Michael McCandless at Jul 22, 2010 at 9:20 am
    It sounds like you should implement a custom Filter?

    Its getDocIdSet would consult your foreign key-value store and iterate
    through the allowed docIDs, per segment.

    Mike
    On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote:
    Hello, we are trying to implement a query type for Lucene (with eventual
    target being Solr) where the query string passed in needs to be "filtered"
    through a large list of document IDs per user. We can't store the user ID
    information in the lucene index per document so we were planning to pull the
    list of documents owned by user X from a key-value store at query time and
    then build some sort of filter in memory before doing the Lucene/Solr query.
    For example:

    content:"cars" user_id:X567

    would first pull the list of docIDs that user_id:X567 has "access" to from a
    keyvalue store and then we'd query the main index with content:"cars" but
    only allow the docIDs that came back to be part of the response. The list of
    docIDs can near the hundreds of thousands.

    What should I be looking at to implement such a feature?

    Thank you
    Martin
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Burton-West, Tom at Jul 22, 2010 at 8:38 pm
    Hi Mike and Martin,

    We have a similar use-case. Is there a scalability/performance issue with the getDocIdSet having to iterate through hundreds of thousands of docIDs?

    Tom Burton-West
    http://www.hathitrust.org/blogs/large-scale-search

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, July 22, 2010 5:20 AM
    To: java-user@lucene.apache.org
    Subject: Re: on-the-fly "filters" from docID lists

    It sounds like you should implement a custom Filter?

    Its getDocIdSet would consult your foreign key-value store and iterate
    through the allowed docIDs, per segment.

    Mike
    On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote:
    Hello, we are trying to implement a query type for Lucene (with eventual
    target being Solr) where the query string passed in needs to be "filtered"
    through a large list of document IDs per user. We can't store the user ID
    information in the lucene index per document so we were planning to pull the
    list of documents owned by user X from a key-value store at query time and
    then build some sort of filter in memory before doing the Lucene/Solr query.
    For example:

    content:"cars" user_id:X567

    would first pull the list of docIDs that user_id:X567 has "access" to from a
    keyvalue store and then we'd query the main index with content:"cars" but
    only allow the docIDs that came back to be part of the response. The list of
    docIDs can near the hundreds of thousands.

    What should I be looking at to implement such a feature?

    Thank you
    Martin
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Jul 23, 2010 at 12:56 am
    Well, Lucene can apply such a filter rather quickly; but, your custom
    code first has to build it... so it's really a question of whether
    your custom code can build up / iterate the filter scalably.

    Mike
    On Thu, Jul 22, 2010 at 4:37 PM, Burton-West, Tom wrote:
    Hi Mike and Martin,

    We have a similar use-case.   Is there a scalability/performance issue with the getDocIdSet having to iterate through hundreds of thousands of docIDs?

    Tom Burton-West
    http://www.hathitrust.org/blogs/large-scale-search

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, July 22, 2010 5:20 AM
    To: java-user@lucene.apache.org
    Subject: Re: on-the-fly "filters" from docID lists

    It sounds like you should implement a custom Filter?

    Its getDocIdSet would consult your foreign key-value store and iterate
    through the allowed docIDs, per segment.

    Mike
    On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote:
    Hello, we are trying to implement a query type for Lucene (with eventual
    target being Solr) where the query string passed in needs to be "filtered"
    through a large list of document IDs per user. We can't store the user ID
    information in the lucene index per document so we were planning to pull the
    list of documents owned by user X from a key-value store at query time and
    then build some sort of filter in memory before doing the Lucene/Solr query.
    For example:

    content:"cars" user_id:X567

    would first pull the list of docIDs that user_id:X567 has "access" to from a
    keyvalue store and then we'd query the main index with content:"cars" but
    only allow the docIDs that came back to be part of the response. The list of
    docIDs can near the hundreds of thousands.

    What should I be looking at to implement such a feature?

    Thank you
    Martin
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Harwood at Jul 23, 2010 at 6:56 am
    Re scalability of filter construction - the database is likely to hold stable primary keys not lucene doc ids which are unstable in the face of updates. You therefore need a quick way of converting stable database keys read from the db into current lucene doc ids to create the filter. That could involve a lot of disk seeks unless you cache a pk->docid lookup in ram. You should use cachingwrapperfilter too to cache the computed user permissions from one search to the next.
    This can get messy. If the access permissions are centred around roles/groups it is normally faster to tag docs with these group names and query them with the list of roles the user holds.
    If individual user-doc-level perms are required you could also consider dynamically looking up perms for just the top n results being shown at the risk of needing to repeat the query with a larger n if insufficient matches pass the lookup.

    Cheers
    Mark
    ----------------------------------------

    On 23 Jul 2010, at 01:55, Michael McCandless wrote:

    Well, Lucene can apply such a filter rather quickly; but, your custom
    code first has to build it... so it's really a question of whether
    your custom code can build up / iterate the filter scalably.

    Mike
    On Thu, Jul 22, 2010 at 4:37 PM, Burton-West, Tom wrote:
    Hi Mike and Martin,

    We have a similar use-case. Is there a scalability/performance issue with the getDocIdSet having to iterate through hundreds of thousands of docIDs?

    Tom Burton-West
    http://www.hathitrust.org/blogs/large-scale-search

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, July 22, 2010 5:20 AM
    To: java-user@lucene.apache.org
    Subject: Re: on-the-fly "filters" from docID lists

    It sounds like you should implement a custom Filter?

    Its getDocIdSet would consult your foreign key-value store and iterate
    through the allowed docIDs, per segment.

    Mike
    On Wed, Jul 21, 2010 at 8:37 AM, Martin J wrote:
    Hello, we are trying to implement a query type for Lucene (with eventual
    target being Solr) where the query string passed in needs to be "filtered"
    through a large list of document IDs per user. We can't store the user ID
    information in the lucene index per document so we were planning to pull the
    list of documents owned by user X from a key-value store at query time and
    then build some sort of filter in memory before doing the Lucene/Solr query.
    For example:

    content:"cars" user_id:X567

    would first pull the list of docIDs that user_id:X567 has "access" to from a
    keyvalue store and then we'd query the main index with content:"cars" but
    only allow the docIDs that came back to be part of the response. The list of
    docIDs can near the hundreds of thousands.

    What should I be looking at to implement such a feature?

    Thank you
    Martin
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Burton-West, Tom at Jul 23, 2010 at 5:04 pm
    Hi all,
    Re scalability of filter construction - the database is likely to hold stable primary keys not lucene doc ids
    which are unstable in the face of updates.
    This is the scalability issue I was concerned about. Assume the database call efficiently retrieves a sorted array of 50,000 stable primary keys. What is the best way to efficiently convert that list of primary keys to Lucene docIds.

    I was looking at the Lucene in Action example code (which was not designed for this use case) where the Lucene docId is retrieved by iteratively calling termDocs.read. How expensive is this operation? Would 50,000 calls return in a few seconds or less?

    for (String isbn : isbns) {
    if (isbn != null) {
    TermDocs termDocs =
    reader.termDocs(new Term("isbn", isbn));
    int count = termDocs.read(docs, freqs);
    if (count == 1) {
    bits.set(docs[0]);
    }
    That could involve a lot of disk seeks unless you cache a pk->docid lookup in ram.
    That sounds interesting. How would the pk->docid lookup get populated?
    Wouldn't a pk->docid cache be invalidated with each commit or merge?

    Tom

    -----Original Message-----
    From: Mark Harwood
    Sent: Friday, July 23, 2010 2:56 AM
    To: java-user@lucene.apache.org
    Subject: Re: on-the-fly "filters" from docID lists

    Re scalability of filter construction - the database is likely to hold stable primary keys not lucene doc ids which are unstable in the face of updates. You therefore need a quick way of converting stable database keys read from the db into current lucene doc ids to create the filter. That could involve a lot of disk seeks unless you cache a pk->docid lookup in ram. You should use cachingwrapperfilter too to cache the computed user permissions from one search to the next.
    This can get messy. If the access permissions are centred around roles/groups it is normally faster to tag docs with these group names and query them with the list of roles the user holds.
    If individual user-doc-level perms are required you could also consider dynamically looking up perms for just the top n results being shown at the risk of needing to repeat the query with a larger n if insufficient matches pass the lookup.

    Cheers
    Mark
    ----------------------------------------


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Harwood at Jul 23, 2010 at 5:53 pm
    What is the best way to efficiently convert that list of primary keys to Lucene docIds.

    Avoid disk seeks. Lucene is fast but still beholden to the laws of physics. Random disk seeks will cost you eg. 50,000 * 5ms =250 seconds (minus any effects of OS disk caching).
    Best way to handle this lookup is a PK-docid cache which can be reused for all users. Since 2.9 Lucene holds caches e.g. FieldCache down at segment level so a commit or merge should only invalidate a subset of cached items. Trouble is I think FieldCache is for docID->FieldValue lookups whereas you want a cache that works the other way around.

    Cheers
    Mark
    I was looking at the Lucene in Action example code (which was not designed for this use case) where the Lucene docId is retrieved by iteratively calling termDocs.read. How expensive is this operation? Would 50,000 calls return in a few seconds or less?

    for (String isbn : isbns) {
    if (isbn != null) {
    TermDocs termDocs =
    reader.termDocs(new Term("isbn", isbn));
    int count = termDocs.read(docs, freqs);
    if (count == 1) {
    bits.set(docs[0]);
    }
    That could involve a lot of disk seeks unless you cache a pk->docid lookup in ram.
    That sounds interesting. How would the pk->docid lookup get populated?
    Wouldn't a pk->docid cache be invalidated with each commit or merge?

    Tom

    -----Original Message-----
    From: Mark Harwood
    Sent: Friday, July 23, 2010 2:56 AM
    To: java-user@lucene.apache.org
    Subject: Re: on-the-fly "filters" from docID lists

    Re scalability of filter construction - the database is likely to hold stable primary keys not lucene doc ids which are unstable in the face of updates. You therefore need a quick way of converting stable database keys read from the db into current lucene doc ids to create the filter. That could involve a lot of disk seeks unless you cache a pk->docid lookup in ram. You should use cachingwrapperfilter too to cache the computed user permissions from one search to the next.
    This can get messy. If the access permissions are centred around roles/groups it is normally faster to tag docs with these group names and query them with the list of roles the user holds.
    If individual user-doc-level perms are required you could also consider dynamically looking up perms for just the top n results being shown at the risk of needing to repeat the query with a larger n if insufficient matches pass the lookup.

    Cheers
    Mark
    ----------------------------------------


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 21, '10 at 12:38p
activeJul 23, '10 at 5:53p
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase