FAQ
Hi!
I have this NotEmptyQuery class (http://gist.github.com/78115) which extends
the MultiTermQuery. The class is added into a BooleanQuery, after some other
queries (e.g. after TermQuery and LongTrieRangeFilter queries).
I wonder: does Lucene need to scan all the terms in the inverted index
and the collect all the document identifiers into the DocIdSet
in order to implement the MultiTermQuery which goes after some other
queries in a BooleanQuery? Like, collecting a thouthand of document
identifiers only to filter a few documents which remain after the other
queries had been fired up?
Or does Lucene use some trickery to only provide a subset of terms to the
MultiTermQuery?

Sorry if this is a dumb question or a part of a FAQ somewhere.
(I haven't found any related performance discussion of Lucene internals
on the site).


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • ArtemGr at Mar 13, 2009 at 4:34 pm

    I wonder: does Lucene need to scan all the terms in the inverted index
    and the collect all the document identifiers into the DocIdSet
    in order to implement the MultiTermQuery which goes after some other
    queries in a BooleanQuery? Like, collecting a thouthand of document
    identifiers only to filter a few documents which remain after the other
    queries had been fired up?
    Adding some debugging output shows that MultiTermQuery still scans all the
    terms even though the other parts of the enclosing BooleanQuery
    had already diminished the results to one or two documents.

    But then, the MultiTermQuery itself looks to be implemented as a filter.
    (I first created a filter, but then decided to make a query
    from the MultiTermQuery in hopes that it would be faster due to some
    query combination technology, but it turns out there is no difference,
    since MultiTermQuery itself only implements a filter).
    I do not understand how MultiTermQuery works with BooleanQuery.
    BooleanWeight asks all the underlying queries for their weights,
    but MultiTermQuery does not implement the createWeight method and should
    throw an UnsupportedOperationException...

    I think there is a space for optimization here:
    IndexReader is basically a "SortedMap<Term,List<Doc>> reader".
    Now, some of the BooleanQuery clauses will diminish that SortedMap.
    Currently they return a "List<(Doc,Float)> scores", not the filtered
    SortedMap<Term,List<Doc>>, but the optimization is still possible
    by dynamically filtering a subset of SortedMap<Term,List<Doc>> terms,
    only returning from it the entries which does still have a passing score.
    E.g., moving the "merge" operation on List<Doc> sets earlier, so that
    the subsequent queries does not even see the documents which were already
    excluded by the previous BooleanQuery clauses.
    It is quite possible that the previous BooleanQuery clause will remove a lot
    of documents from the consideration, so that some terms in the IndexReader
    will have an empty set of documents to consider by the subsequent
    BooleanQuery clauses. A filtered IndexReader won't even pass such terms
    into the subsequent clauses (e.g. queries).
    Thus, in the best case, instead of considering tens of thouthands of unique
    terms and documents, a subsequent clause in a BooleanQuery will only have to
    consider a few.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Michael McCandless at Mar 15, 2009 at 10:07 pm

    ArtemGr wrote:

    I wonder: does Lucene need to scan all the terms in the inverted
    index
    and the collect all the document identifiers into the DocIdSet
    in order to implement the MultiTermQuery which goes after some other
    queries in a BooleanQuery? Like, collecting a thouthand of document
    identifiers only to filter a few documents which remain after the
    other
    queries had been fired up?
    Adding some debugging output shows that MultiTermQuery still scans
    all the
    terms even though the other parts of the enclosing BooleanQuery
    had already diminished the results to one or two documents.
    By default, MultiTermQuery rewrites itself to a BooleanQuery OR'ing
    together
    BooleanClauses for each term in the range.

    So, yes, MultiTermQuery enumerates all terms in the range up front,
    but you should not see it visiting all the docs for those terms
    when another clause in the toplevel BooleanQuery is very restrictive.

    If you use a constant-score query (ConstantScoreRangeQuery in 2.4,
    or on trunk many queries can now do constant-score mode), then
    unfortunately it will enumerate all docs for all terms, up front (well,
    each time it advances to another segment).
    But then, the MultiTermQuery itself looks to be implemented as a
    filter.
    (I first created a filter, but then decided to make a query
    from the MultiTermQuery in hopes that it would be faster due to some
    query combination technology, but it turns out there is no difference,
    since MultiTermQuery itself only implements a filter).
    I do not understand how MultiTermQuery works with BooleanQuery.
    BooleanWeight asks all the underlying queries for their weights,
    but MultiTermQuery does not implement the createWeight method and
    should
    throw an UnsupportedOperationException...
    MultiTermQuery gets rewritten to a new Query impl that implements
    createWeight. (It is rather confusing to "follow" how exactly a query
    is searched).
    I think there is a space for optimization here:
    IndexReader is basically a "SortedMap<Term,List<Doc>> reader".
    Now, some of the BooleanQuery clauses will diminish that SortedMap.
    Currently they return a "List<(Doc,Float)> scores", not the filtered
    SortedMap<Term,List<Doc>>, but the optimization is still possible
    by dynamically filtering a subset of SortedMap<Term,List<Doc>> terms,
    only returning from it the entries which does still have a passing
    score.
    E.g., moving the "merge" operation on List<Doc> sets earlier, so that
    the subsequent queries does not even see the documents which were
    already
    excluded by the previous BooleanQuery clauses.
    It is quite possible that the previous BooleanQuery clause will
    remove a lot
    of documents from the consideration, so that some terms in the
    IndexReader
    will have an empty set of documents to consider by the subsequent
    BooleanQuery clauses. A filtered IndexReader won't even pass such
    terms
    into the subsequent clauses (e.g. queries).
    Thus, in the best case, instead of considering tens of thouthands of
    unique
    terms and documents, a subsequent clause in a BooleanQuery will only
    have to
    consider a few.
    There is a heuristic in ConjunctionScorer that tries to order the
    clauses
    so that we visit the most restrictive ones first. It uses the first
    docID for
    all clauses to predict sparsness. (It would be better to add an
    approxCount()
    to scorers, but likely this heuristic works well in practice).

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 12, '09 at 3:35p
activeMar 15, '09 at 10:07p
posts3
users2
websitelucene.apache.org

2 users in discussion

ArtemGr: 2 posts Michael McCandless: 1 post

People

Translate

site design / logo © 2023 Grokbase