FAQ

Efficient Query Evaluation using a Two-Level Retrieval Process

J. Delgado
Nov 16, 2009 at 6:10 pm

On Mon, Nov 16, 2009 at 9:44 AM, Earwin Burrfoot wrote:
This algo is strictly tied to sort-by-score, if I understand it correctly.
Lucene has queries and sorting decoupled (except for allowOutOfOrder
mess), so implementing it would require some really fat hacks.
According to the paper on Indexing Boolean Expression (using the WAND
algo), sorting can be done based on scores that are determined based
weight assignment to key-value pairs:

http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

So I believe this can be generalized to sorting by any doc attributes
given the proper weight assignment model

Of course, the devil-is-in-the-details :-(

-- Joaquin

On Mon, Nov 16, 2009 at 20:26, J. Delgado wrote:
As I understood it setMinimumNumberShouldMatch(int min) Is used to
specify a minimum number of the optional BooleanClauses which must be
satisfied.

I haven't seen the implementation of setMinimumNumberShouldMatch but
it seems a bit different than what is intended with the WAND operator,
which can take any real number as threshold θ

As stated in the paper:

WAND(X1,w1, . . . Xk,wk, θ) is true iff X 1≤i≤k and SUM(xiwi)≥ θ

where xi is the indicator variable for Xi, that is xi =  1, if Xi is
true 0, otherwise.

Observe that WAND can be used to implement AND
and OR via
AND(X1,X2, . . .Xk) ≡ WAND(X1, 1,X2, 1, . . . Xk, 1, k),
and
OR(X1,X2, . ..Xk) ≡ WAND(X1, 1,X2, 1, . ..Xk, 1, 1).

What I find interesting is the idea of using a first pass using the
upper bound (maximal) contribution of a term on any document score and
the dynamic setting of the threshold θ to skip or to fully evaluate a
document..

As stated in the paper:

"Given this setup our preliminary scoring consists of evaluating
for each document d
WAND(X1,UB1,X2,UB2, . . .,Xk,UBk, θ),
where Xi is an indicator variable for the presence of query term i in
document d and the threshold θ is varied during
the algorithm as explained below. If WAND evaluates to true, then the
document d undergoes a full evaluation.
The threshold θ is set dynamically by the algorithm based on the
minimum score m among the top n results found so
far, where n is the number of requested documents. The larger the
threshold, the more documents will be skipped
and thus we will need to compute full scores for fewer documents."

I think its worth a try...

-- Joaquin
On Mon, Nov 16, 2009 at 2:54 AM, Andrzej Bialecki wrote:

J. Delgado wrote:
Here is the link to the paper.
http://cis.poly.edu/westlab/papers/cntdstrb/p426-broder.pdf

A more recent application of the use and extension of the WAND operator for
indexing of Boolean expressions:
http://ilpubs.stanford.edu:8090/927/2/wand_vldb.pdf

-- Joaquin


On Sun, Nov 15, 2009 at 11:12 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:
Hey Joaquin,

The mailing list strips off attachments. Can you please upload it somewhere
and give us the link?

On Mon, Nov 16, 2009 at 12:35 PM, J. Delgado <joaquin.delgado@gmail.com
wrote:
Please find attached the paper on "Efficient Query Evaluation using a
Two-Level Retrieval Process". I believe that such approach may improve the
way Lucene/Solr evaluates queries today.
The functionality of WAND (weak AND) is already implemented in Lucene, if I understand it correctly - this is the BooleanQuery.setMinShouldMatch(int). Lucene implements this probably differently from the algorithm described in the paper, so there may be still some benefits from comparing the algorithms in Lucene's BooleanScorer[2] with this one ...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


--
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
reply

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions