I wonder: does Lucene need to scan all the terms in the inverted index
and the collect all the document identifiers into the DocIdSet
in order to implement the MultiTermQuery which goes after some other
queries in a BooleanQuery? Like, collecting a thouthand of document
identifiers only to filter a few documents which remain after the other
queries had been fired up?
Adding some debugging output shows that MultiTermQuery still scans all the
terms even though the other parts of the enclosing BooleanQuery
had already diminished the results to one or two documents.
But then, the MultiTermQuery itself looks to be implemented as a filter.
(I first created a filter, but then decided to make a query
from the MultiTermQuery in hopes that it would be faster due to some
query combination technology, but it turns out there is no difference,
since MultiTermQuery itself only implements a filter).
I do not understand how MultiTermQuery works with BooleanQuery.
BooleanWeight asks all the underlying queries for their weights,
but MultiTermQuery does not implement the createWeight method and should
throw an UnsupportedOperationException...
I think there is a space for optimization here:
IndexReader is basically a "SortedMap<Term,List<Doc>> reader".
Now, some of the BooleanQuery clauses will diminish that SortedMap.
Currently they return a "List<(Doc,Float)> scores", not the filtered
SortedMap<Term,List<Doc>>, but the optimization is still possible
by dynamically filtering a subset of SortedMap<Term,List<Doc>> terms,
only returning from it the entries which does still have a passing score.
E.g., moving the "merge" operation on List<Doc> sets earlier, so that
the subsequent queries does not even see the documents which were already
excluded by the previous BooleanQuery clauses.
It is quite possible that the previous BooleanQuery clause will remove a lot
of documents from the consideration, so that some terms in the IndexReader
will have an empty set of documents to consider by the subsequent
BooleanQuery clauses. A filtered IndexReader won't even pass such terms
into the subsequent clauses (e.g. queries).
Thus, in the best case, instead of considering tens of thouthands of unique
terms and documents, a subsequent clause in a BooleanQuery will only have to
consider a few.
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org