FAQ
Greetings,

I've been digging in to this for two days now and have come up short -
hopefully there is some simple answer I am just not seeing:

I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
identically as possible (given deprecations) and indexing the same document.

For most queries the results are very close (scoring within three
significant differences, almost identical positions in results).

However, for certain documents, the scores are very different (causing
these docs to be ranked +/- 25 positions different or more in the results)

In looking at debugQuery output, it seems like this is due to fieldNorm
values being lower for the 3.6.0 instance than the 1.4.1.

(note that for most docs, the fieldNorms are identical)

I have taken the field values for the example below and run them
through /admin/analysis.jsp on each solr instance. Even for the problematic
docs/fields, the results are almost identical. For the example below, the
t_tag values for the problematic doc:
1.4.1: 162 values
3.6.0: 164 values

note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1,
however, (1/0.0625)^2 = 256, which is no where near 164

Here is a particular example from 1.4.1:
1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of:
3.8729835 = tf(termFreq(t_tag:soul)=15)
5.3750753 = idf(docFreq=27619, maxDocs=2194294)
0.078125 = fieldNorm(field=t_tag, doc=2066419)

And the same from 3.6.0:
1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of:
3.8729835 = tf(termFreq(t_tag:soul)=15)
5.388126 = idf(docFreq=27740, maxDocs=2232857)
0.0625 = fieldNorm(field=t_tag, doc=1977957)


Here is the 1.4.1 config for the t_tag field and text type:
<fieldtype name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
</analyzer>
</fieldtype>
<dynamicField name="t_*" type="text" indexed="true" stored="true"
required="false" multiValued="true" termVectors="true"/>


And 3.6.0 schema config for the t_tag field and text type:
<fieldtype name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
<field name="t_tag" type="text" indexed="true" stored="true"
required="false" multiValued="true"/>

I at first got distracted by this change between versions:
LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This
means that terms with a position increment gap of zero do not affect the
norms calculation by default.
However, this doesn't appear to be causing the issue as, according to
analysis.jsp there is no overlap for t_tag...

Can you point me to where these fieldNorm differences are coming from and
why they'd only be happing for a select few documents for which the content
doesn't stand out?

Thank you,
Aaron

Search Discussions

  • Robert Muir at Jul 19, 2012 at 1:44 pm

    On Thu, Jul 19, 2012 at 12:10 AM, Aaron Daubman wrote:
    Greetings,

    I've been digging in to this for two days now and have come up short -
    hopefully there is some simple answer I am just not seeing:

    I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
    identically as possible (given deprecations) and indexing the same document.
    Why did you do this? If you want the exact same scoring, use the exact
    same analysis.
    This means specifying luceneMatchVersion = 2.9, and the exact same
    analysis components (even if deprecated).
    I have taken the field values for the example below and run them
    through /admin/analysis.jsp on each solr instance. Even for the problematic
    docs/fields, the results are almost identical. For the example below, the
    t_tag values for the problematic doc:
    1.4.1: 162 values
    3.6.0: 164 values
    This is why: you changed your analysis.

    --
    lucidimagination.com
  • Aaron Daubman at Jul 19, 2012 at 3:12 pm
    Robert,
    I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
    identically as possible (given deprecations) and indexing the same
    document.

    Why did you do this? If you want the exact same scoring, use the exact
    same analysis.
    This means specifying luceneMatchVersion = 2.9, and the exact same
    analysis components (even if deprecated).
    I have taken the field values for the example below and run them
    through /admin/analysis.jsp on each solr instance. Even for the
    problematic
    docs/fields, the results are almost identical. For the example below, the
    t_tag values for the problematic doc:
    1.4.1: 162 values
    3.6.0: 164 values
    This is why: you changed your analysis.
    Apologies if I didn't clearly state my goal/concern: I am not looking for
    the exact same scoring - I am looking to explain scoring differences.
    Deprecated components will eventually go away, time moves on, etc...
    etc... I would like to be able to run current code, and should be able to -
    the part that is sticking is being able to *explain* the difference in
    results.

    As you can see from my email, after running the different analysis on the
    input, the output does not demonstrate (in any way that I can see) why the
    fieldNorm values would be so different. Even with the different analysis,
    the results are almost identical - which *should* result in an almost
    identical fieldNorm???

    Again, the desire is not to be the same, it is to understand the difference.

    Thanks,
    Aaron
  • Robert Muir at Jul 19, 2012 at 3:55 pm

    On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman wrote:

    Apologies if I didn't clearly state my goal/concern: I am not looking for
    the exact same scoring - I am looking to explain scoring differences.
    Deprecated components will eventually go away, time moves on, etc...
    etc... I would like to be able to run current code, and should be able to -
    the part that is sticking is being able to *explain* the difference in
    results.
    OK: i totally missed that, sorry!

    to explain why you see such a large difference:

    The difference is that these length normalizations are computed at
    index time and fit inside a *single byte* by default. This is to keep
    ram usage low for many documents and many fields with norms (since its
    #fieldsWithNorms * #documents in bytes in ram).
    So this is lossy: basically you can think of there being only 256
    possible values. So when you increased the number of terms only
    slightly by changing your analysis, this happened to bump you over the
    edge rounding you up to the next value.

    more information:
    http://lucene.apache.org/core/3_6_0/scoring.html
    http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

    by the way: if you don't like this:
    1. if you can still live with a single byte, maybe plug in your own
    Similarity class into 3.6, overriding decodeNormValue/encodeNormValue.
    For example, you could use a different SmallFloat configuration that
    has less range but more precision for your use case (if your docs are
    all short or whatever)
    2. otherwise, if you feel you need more than a single byte, check out
    4.0-ALPHA: you arent limited to a single byte there.

    --
    lucidimagination.com
  • Aaron Daubman at Jul 19, 2012 at 5:58 pm
    Robert,

    So this is lossy: basically you can think of there being only 256
    possible values. So when you increased the number of terms only
    slightly by changing your analysis, this happened to bump you over the
    edge rounding you up to the next value.

    more information:
    http://lucene.apache.org/core/3_6_0/scoring.html

    http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html


    Thanks - this was extremely helpful! I had read both sources before but
    didn't grasp the magnitude of lossy-ness until your pointer and mention of
    edge-case.
    Just to help out anybody else who might run in to this, I hacked together a
    little harness to demonstrate:
    ---
    fieldLength: 160, computeNorm: 0.07905694, floatToByte315: 109,
    byte315ToFloat: 0.078125
    fieldLength: 161, computeNorm: 0.07881104, floatToByte315: 109,
    byte315ToFloat: 0.078125
    fieldLength: 162, computeNorm: 0.07856742, floatToByte315: 109,
    byte315ToFloat: 0.078125
    fieldLength: 163, computeNorm: 0.07832605, floatToByte315: 109,
    byte315ToFloat: 0.078125
    fieldLength: 164, computeNorm: 0.07808688, floatToByte315: 108,
    byte315ToFloat: 0.0625
    fieldLength: 165, computeNorm: 0.077849895, floatToByte315: 108,
    byte315ToFloat: 0.0625
    fieldLength: 166, computeNorm: 0.07761505, floatToByte315: 108,
    byte315ToFloat: 0.0625
    ---

    So my takeaway is that these scores that vary significantly are caused by:
    1) a field with lengths right on this boundary between the two analyzer
    chains
    2) the fact that we might be searching for matches from 50+ values to a
    field with 150+ values, and so the overall score is repeatedly impacted by
    the otherwise typically insignificant change in fieldNorm value

    Thanks again,
    Aaron

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupsolr-user @
categorieslucene
postedJul 19, '12 at 4:10a
activeJul 19, '12 at 5:58p
posts5
users2
websitelucene.apache.org...

2 users in discussion

Aaron Daubman: 3 posts Robert Muir: 2 posts

People

Translate

site design / logo © 2018 Grokbase