FAQ
I added some timing logging to IndexSearcher and ScaleFloatFunction and
compared a simple DisMax query with a DisMax query wrapped in the scale
function. The index size was 500K docs, 61K docs match the DisMax query.
The simple DisMax query took 33 ms, the function query took 89 ms. What I
found was:

1. The scale query only normalized the scores once (in
ScaleInfo.createScaleInfo) and added 33 ms to the Qtime. Subsequent calls
to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and added ~0 time.

2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax
'nextDoc' iterations.

Here's the breakdown:

Simple DisMax query:
weight.scorer: 3 ms (get term enum)
scorer.score: 23 ms (nextDoc iterations)
other: 3 ms
Total: 33 ms

DisMax wrapped in ScaleFloatFunction:
weight.scorer: 39 ms (get scaled values)
scorer.score: 39 ms (nextDoc iterations)
other: 11 ms
Total: 89 ms

Even with any improvements to 'scale', all function queries will add a
linear increase to the Qtime as index size increases, since they match all
docs.

Trey: I'd be happy to test any patch that you find improves the speed.


On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger wrote:

We're working on the same problem with the combination of the
scale(query(...)) combination, so I'd like to share a bit more information
that may be useful.

*On the scale function:*
Even thought the scale query has to calculate the scores for all documents,
it is actually doing this work twice for each ValueSource (once to
calculate the min and max values, and then again when actually scoring the
documents), which is inefficient.

To solve the problem, we're in the process of putting a cache inside the
scale function to remember the values for each document when they are
initially computed (to find the min and max) so that the second pass can
just use the previously computed values for each document. Our theory is
that most of the extra time due to the scale function is really just the
result of doing duplicate work.

No promises this won't be overly costly in terms of memory utilization, but
we'll see what we get in terms of speed improvements and will share the
code if it works out well. Alternate implementation suggestions (or
criticism of a cache like this) are also welcomed.


*On the NoOp product function: scale(prod(1, query(...))):*
We do the same thing, which ultimately is just an unnecessary waste of a
loop through all documents to do an extra multiplication step. I just
debugged the code and uncovered the problem. There is a Map (called
context) that is passed through to each value source to store intermediate
state, and both the query and scale functions are passing the ValueSource
for the query function in as the KEY to this Map (as opposed to using some
composite key that makes sense in the current context). Essentially, these
lines are overwriting each other:

Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
//this.source refers to the QueryValueSource, and the scaleInfo refers to
a ScaleInfo object
Inside QueryValueSource: context.put(this, w); //this refers to the same
QueryValueSource from above, and the w refers to a Weight object

As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
the context Map, it unexpectedly pulls the Weight object out instead and
thus the invalid case exception occurs. The NoOp multiplication works
because it puts an "different" ValueSource between the query and the
ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
(in QueryValueSource).

This should be an easy fix. I'll create a JIRA ticket to use better key
names in these functions and push up a patch. This will eliminate the need
for the extra NoOp function.

-Trey


On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan <peterlkeegan@gmail.com
wrote:
I'm persuing this possible PostFilter solution, I can see how to collect
all the hits and recompute the scores in a PostFilter, after all the hits
have been collected (for scaling). Now, I can't see how to get the custom
doc/score values back into the main query's HitQueue. Any advice?

Thanks,
Peter


On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan <peterlkeegan@gmail.com
wrote:
Instead of using a function query, could I use the edismax query (plus
some low cost filters not shown in the example) and implement the
scale/sum/product computation in a PostFilter? Is the query's maxScore
available there?

Thanks,
Peter


On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan <peterlkeegan@gmail.com
wrote:
Although the 'scale' is a big part of it, here's a closer breakdown.
Here
are 4 queries with increasing functions, and theei response times
(caching
turned off in solrconfig):

100 msec:
select?q={!edismax v='news' qf='title^2 body'}

135 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq}

200 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield))))&fq={!query
v=$qq}

320 msec:
select?qq={!edismax v='news' qf='title^2
body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query
v=$qq}

Btw, that no-op product is necessary, else you get this exception:

org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to
org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo
thanks,

peter



On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter <
hossman_lucene@fucit.org> wrote:
: So, this query does just what I want, but it's typically 3 times
slower
: than the edismax query without the functions:

that's because the scale() function is inhernetly slow (it has to
compute the min & max value for every document in order to know how
to
scale them)

what you are seeing is the price you have to pay to get that query
with a
"normalized" 0-1 value.

(you might be able to save a little bit of time by eliminating that
no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll
even
notice much of a chnage given that scale function.

: Is there any way to speed this up? Would writing a custom function
query
: that compiled all the function queries together be any faster?

If you can find a faster implementation for scale() then by all means
let
us konw, and we can fold it back into Solr.


-Hoss

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 14 of 19 | next ›
Discussion Overview
groupsolr-user @
categorieslucene
postedNov 7, '13 at 1:56p
activeJan 6, '14 at 6:23p
posts19
users6
websitelucene.apache.org...

People

Translate

site design / logo © 2021 Grokbase