FAQ
I have documents with this simple schema in Lucene which I can not change.
docid: (int)
contents: (text)

The user is given a list of 10,000 documents in a tree which they select to
search, usually they select 5000 or so.

I only want to search those 5000 documents. I have the 'id' fields. That is
all.

I do this now:

Get the 'Hits' for all documents.
Loop through all Hits looking for any 'docid' that is in the 5000 selected
by the user
Add found docs to a collection of found documents and return that to the UI.


Is there a better way of doing this?

Search Discussions

  • Erick Erickson at Dec 4, 2008 at 1:37 pm
    It's generally a bad idea to iterate a Hits object. In fact, Hits
    is deprecated in recent versions of Lucene. The underlying
    problem is that the query is re-executed every 100 responses
    or so.

    First suggestion, create a Filter by iterating over your
    docid field and use that in your searches see
    several of the Searcher.search variants.

    Second suggestion, use one of the collector classes rather than
    Hits, e.g. TopDoc*, TopFieldDoc*, whichever suits.


    Best
    Erick
    On Thu, Dec 4, 2008 at 7:59 AM, Ian Vink wrote:

    I have documents with this simple schema in Lucene which I can not change.
    docid: (int)
    contents: (text)

    The user is given a list of 10,000 documents in a tree which they select to
    search, usually they select 5000 or so.

    I only want to search those 5000 documents. I have the 'id' fields. That is
    all.

    I do this now:

    Get the 'Hits' for all documents.
    Loop through all Hits looking for any 'docid' that is in the 5000 selected
    by the user
    Add found docs to a collection of found documents and return that to the
    UI.


    Is there a better way of doing this?
  • Ian Vink at Dec 4, 2008 at 9:21 pm
    So, let me get this straight. :)

    A Query tells Lucene what to search for. Then a Filter tells lucene what?

    I think I'm missing understanding about what a Filter is for.

    Ian


    On Thu, Dec 4, 2008 at 9:36 AM, Erick Erickson wrote:

    It's generally a bad idea to iterate a Hits object. In fact, Hits
    is deprecated in recent versions of Lucene. The underlying
    problem is that the query is re-executed every 100 responses
    or so.

    First suggestion, create a Filter by iterating over your
    docid field and use that in your searches see
    several of the Searcher.search variants.

    Second suggestion, use one of the collector classes rather than
    Hits, e.g. TopDoc*, TopFieldDoc*, whichever suits.


    Best
    Erick
    On Thu, Dec 4, 2008 at 7:59 AM, Ian Vink wrote:

    I have documents with this simple schema in Lucene which I can not change.
    docid: (int)
    contents: (text)

    The user is given a list of 10,000 documents in a tree which they select to
    search, usually they select 5000 or so.

    I only want to search those 5000 documents. I have the 'id' fields. That is
    all.

    I do this now:

    Get the 'Hits' for all documents.
    Loop through all Hits looking for any 'docid' that is in the 5000 selected
    by the user
    Add found docs to a collection of found documents and return that to the
    UI.


    Is there a better way of doing this?
  • Erick Erickson at Dec 4, 2008 at 10:13 pm
    See the class in the docs or Lucene In Action for more
    detail, but here's the short form.....

    A Filter is a bitset where each bit's ordinal position stands
    for a document. I.e. bit 1 means doc id 1, bit 519
    represents document 519 etc.

    When you pass a filter to one of the search routines that accepts a Filter,
    the search is restricted to ONLY those documents with the corresponding
    bit set in the Filter. Which sounds like just what you want, but I may be
    wrong.

    You construct a Filter by iterating over the terms you care about (see
    TermDocs/TermEnum classes). In a nutshell you find the terms you
    care about, and for each document that contains that term, set the
    bit in your filter. This is actually much faster than you might think. All
    this is provided for you in the two classes above, how you use them
    depends (tm). Watch that you don't run past the term you care about, I
    remember having that happen in one of those classes but the details
    escape my aging memory.

    Hope that helps
    Erick
    On Thu, Dec 4, 2008 at 4:20 PM, Ian Vink wrote:

    So, let me get this straight. :)

    A Query tells Lucene what to search for. Then a Filter tells lucene what?

    I think I'm missing understanding about what a Filter is for.

    Ian



    On Thu, Dec 4, 2008 at 9:36 AM, Erick Erickson <erickerickson@gmail.com
    wrote:
    It's generally a bad idea to iterate a Hits object. In fact, Hits
    is deprecated in recent versions of Lucene. The underlying
    problem is that the query is re-executed every 100 responses
    or so.

    First suggestion, create a Filter by iterating over your
    docid field and use that in your searches see
    several of the Searcher.search variants.

    Second suggestion, use one of the collector classes rather than
    Hits, e.g. TopDoc*, TopFieldDoc*, whichever suits.


    Best
    Erick
    On Thu, Dec 4, 2008 at 7:59 AM, Ian Vink wrote:

    I have documents with this simple schema in Lucene which I can not change.
    docid: (int)
    contents: (text)

    The user is given a list of 10,000 documents in a tree which they
    select
    to
    search, usually they select 5000 or so.

    I only want to search those 5000 documents. I have the 'id' fields.
    That
    is
    all.

    I do this now:

    Get the 'Hits' for all documents.
    Loop through all Hits looking for any 'docid' that is in the 5000 selected
    by the user
    Add found docs to a collection of found documents and return that to
    the
    UI.


    Is there a better way of doing this?
  • Ian Vink at Dec 5, 2008 at 2:03 am
    I bought your book :)
    Thanks, I will look into it.
    On Thu, Dec 4, 2008 at 6:12 PM, Erick Erickson wrote:

    See the class in the docs or Lucene In Action for more
    detail, but here's the short form.....

    A Filter is a bitset where each bit's ordinal position stands
    for a document. I.e. bit 1 means doc id 1, bit 519
    represents document 519 etc.

    When you pass a filter to one of the search routines that accepts a Filter,
    the search is restricted to ONLY those documents with the corresponding
    bit set in the Filter. Which sounds like just what you want, but I may be
    wrong.

    You construct a Filter by iterating over the terms you care about (see
    TermDocs/TermEnum classes). In a nutshell you find the terms you
    care about, and for each document that contains that term, set the
    bit in your filter. This is actually much faster than you might think. All
    this is provided for you in the two classes above, how you use them
    depends (tm). Watch that you don't run past the term you care about, I
    remember having that happen in one of those classes but the details
    escape my aging memory.

    Hope that helps
    Erick
    On Thu, Dec 4, 2008 at 4:20 PM, Ian Vink wrote:

    So, let me get this straight. :)

    A Query tells Lucene what to search for. Then a Filter tells lucene what?

    I think I'm missing understanding about what a Filter is for.

    Ian



    On Thu, Dec 4, 2008 at 9:36 AM, Erick Erickson <erickerickson@gmail.com
    wrote:
    It's generally a bad idea to iterate a Hits object. In fact, Hits
    is deprecated in recent versions of Lucene. The underlying
    problem is that the query is re-executed every 100 responses
    or so.

    First suggestion, create a Filter by iterating over your
    docid field and use that in your searches see
    several of the Searcher.search variants.

    Second suggestion, use one of the collector classes rather than
    Hits, e.g. TopDoc*, TopFieldDoc*, whichever suits.


    Best
    Erick
    On Thu, Dec 4, 2008 at 7:59 AM, Ian Vink wrote:

    I have documents with this simple schema in Lucene which I can not change.
    docid: (int)
    contents: (text)

    The user is given a list of 10,000 documents in a tree which they
    select
    to
    search, usually they select 5000 or so.

    I only want to search those 5000 documents. I have the 'id' fields.
    That
    is
    all.

    I do this now:

    Get the 'Hits' for all documents.
    Loop through all Hits looking for any 'docid' that is in the 5000 selected
    by the user
    Add found docs to a collection of found documents and return that to
    the
    UI.


    Is there a better way of doing this?
  • Ian Vink at Dec 5, 2008 at 2:49 am
    It works.
    For those using Lucene.NET here is an example of a Filter that takes a list
    of IDs for books:


    public class BookFilter: Filter
    {
    private readonly List<int> bookIDs;

    public BookFilter(List<int> bookIDsToSearch)
    {
    bookIDs = bookIDsToSearch;
    }

    public override BitArray Bits(IndexReader reader)
    {
    BitArray bits = new BitArray(50000);
    int[] docs = new int[1];
    int[] freqs = new int[1];

    foreach (int bookID in bookIDs)
    {
    TermDocs termDocs = reader.TermDocs(new Term("id",
    bookID.ToString()));
    int count = termDocs.Read(docs, freqs);
    if(count==1)
    bits.Set(docs[0],true);
    }
    return bits;
    }
    }
  • Erick Erickson at Dec 5, 2008 at 1:38 pm
    Glad it's working, but it's not my book, that's Erik Hatcher not
    Erick Erickson.....

    Erik:
    Do I get a commission?
    On Thu, Dec 4, 2008 at 9:48 PM, Ian Vink wrote:

    It works.
    For those using Lucene.NET here is an example of a Filter that takes a list
    of IDs for books:


    public class BookFilter: Filter
    {
    private readonly List<int> bookIDs;

    public BookFilter(List<int> bookIDsToSearch)
    {
    bookIDs = bookIDsToSearch;
    }

    public override BitArray Bits(IndexReader reader)
    {
    BitArray bits = new BitArray(50000);
    int[] docs = new int[1];
    int[] freqs = new int[1];

    foreach (int bookID in bookIDs)
    {
    TermDocs termDocs = reader.TermDocs(new Term("id",
    bookID.ToString()));
    int count = termDocs.Read(docs, freqs);
    if(count==1)
    bits.Set(docs[0],true);
    }
    return bits;
    }
    }
  • Otis Gospodnetic at Dec 5, 2008 at 3:56 pm
    Yeah, I think we'll have to start paying the commission fee! ;)


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Erick Erickson <erickerickson@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Friday, December 5, 2008 8:37:20 AM
    Subject: Re: Design guidance - search strategy

    Glad it's working, but it's not my book, that's Erik Hatcher not
    Erick Erickson.....

    Erik:
    Do I get a commission?
    On Thu, Dec 4, 2008 at 9:48 PM, Ian Vink wrote:

    It works.
    For those using Lucene.NET here is an example of a Filter that takes a list
    of IDs for books:


    public class BookFilter: Filter
    {
    private readonly ListbookIDs;

    public BookFilter(ListbookIDsToSearch)
    {
    bookIDs = bookIDsToSearch;
    }

    public override BitArray Bits(IndexReader reader)
    {
    BitArray bits = new BitArray(50000);
    int[] docs = new int[1];
    int[] freqs = new int[1];

    foreach (int bookID in bookIDs)
    {
    TermDocs termDocs = reader.TermDocs(new Term("id",
    bookID.ToString()));
    int count = termDocs.Read(docs, freqs);
    if(count==1)
    bits.Set(docs[0],true);
    }
    return bits;
    }
    }

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 4, '08 at 1:00p
activeDec 5, '08 at 3:56p
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase