FAQ
I am using lucene 2.9.3 (via Solr 1.4.1) on windows and am trying to
understand ShingleFilter. I wrote the following code and find that if
I provide more words than the actual phrase indexed in the field, then
the search on that field fails (no score found with debugQuery=true).

Here is an example to reproduce, with field names:
Id: 1
title_1: Nina Simone
title_2: I put a spell on you

Query (dismax) with:
- “Nina Simone I put” <- Fails i.e. no score shown from title_1
search (using debugQuery)
- “Nina Simone” <- SUCCESS

But, when I used Solr’s Field Analysis with the ‘shingle’ field (given
below) and tried “Nina Simone I put”, it succeeds. It’s only during
the query that no score is provided. I also checked ‘parsedquery’ and
it shows disjunctionMaxQuery issuing the string “Nina_Simone Simone_I
I_put” to the title_1 field.

title_1 and title_2 fields are of type ‘shingle’, defined as:

<fieldType name="shingle" class="solr.TextField"
positionIncrementGap="100" indexed="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
maxShingleSize="2" outputUnigrams="false"/>
</analyzer>
</fieldType>

Note that I also have a catchall field which is text. I have qf set
to: 'id^2 catchall' and pf set to: 'title_1^1.5 title_2^1.2'

If I am missing something or doing something wrong please let me know.

-Ethan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Steven A Rowe at Jul 13, 2010 at 3:47 pm
    Hi Ethan,

    You'll probably get better answers about Solr specific stuff on the solr-user@a.l.o list.

    Check out PositionFilterFactory - it may address your issue:

    http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory

    Steve
    -----Original Message-----
    From: Ethan Collins
    Sent: Tuesday, July 13, 2010 3:42 AM
    To: java-user@lucene.apache.org
    Subject: ShingleFilter failing with more terms than index phrase

    I am using lucene 2.9.3 (via Solr 1.4.1) on windows and am trying to
    understand ShingleFilter. I wrote the following code and find that if I
    provide more words than the actual phrase indexed in the field, then the
    search on that field fails (no score found with debugQuery=true).

    Here is an example to reproduce, with field names:
    Id: 1
    title_1: Nina Simone
    title_2: I put a spell on you

    Query (dismax) with:
    - “Nina Simone I put” <- Fails i.e. no score shown from title_1 search
    (using debugQuery)
    - “Nina Simone” <- SUCCESS

    But, when I used Solr’s Field Analysis with the ‘shingle’ field (given
    below) and tried “Nina Simone I put”, it succeeds. It’s only during the
    query that no score is provided. I also checked ‘parsedquery’ and it shows
    disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the
    title_1 field.

    title_1 and title_2 fields are of type ‘shingle’, defined as:

    <fieldType name="shingle" class="solr.TextField"
    positionIncrementGap="100" indexed="true" stored="true">
    <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory"
    maxShingleSize="2" outputUnigrams="false"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory"
    maxShingleSize="2" outputUnigrams="false"/>
    </analyzer>
    </fieldType>

    Note that I also have a catchall field which is text. I have qf set
    to: 'id^2 catchall' and pf set to: 'title_1^1.5 title_2^1.2'

    If I am missing something or doing something wrong please let me know.

    -Ethan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ethan Collins at Jul 14, 2010 at 7:16 am
    Hi Steve,

    Thanks for your kind response. I checked PositionfilterFactory
    (re-index as well) but that also didn't solve the problem. Interesting
    the problem is not reproduceable from Solr's Field Analysis page, it
    manifests only when it's in a query.

    I guess the subject for this post is not very correct, it's not that
    ShingleFilter is failing but -- using ShingleFilter, there is no score
    provided by the shingle field when I pass more terms than the indexed
    terms. I observe this using debugQuery.

    I had actually posted to solr-user but received no response yet.
    Probably because the problem is not clear at first glance. However,
    there's an example I have put in the mail for someone interested to
    try out and check if there's a problem. Let's see if I receive any
    response.

    -Ethan
    On Tue, Jul 13, 2010 at 9:15 PM, Steven A Rowe wrote:
    Hi Ethan,

    You'll probably get better answers about Solr specific stuff on the solr-user@a.l.o list.

    Check out PositionFilterFactory - it may address your issue:

    http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory

    Steve
    -----Original Message-----
    From: Ethan Collins
    Sent: Tuesday, July 13, 2010 3:42 AM
    To: java-user@lucene.apache.org
    Subject: ShingleFilter failing with more terms than index phrase

    I am using lucene 2.9.3 (via Solr 1.4.1) on windows and am trying to
    understand ShingleFilter. I wrote the following code and find that if I
    provide more words than the actual phrase indexed in the field, then the
    search on that field fails (no score found with debugQuery=true).

    Here is an example to reproduce, with field names:
    Id: 1
    title_1: Nina Simone
    title_2: I put a spell on you

    Query (dismax) with:
    - “Nina Simone I put”  <- Fails i.e. no score shown from title_1 search
    (using debugQuery)
    - “Nina Simone” <- SUCCESS

    But, when I used Solr’s Field Analysis with the ‘shingle’ field (given
    below) and tried “Nina Simone I put”, it succeeds. It’s only during the
    query that no score is provided. I also checked ‘parsedquery’ and it shows
    disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the
    title_1 field.

    title_1 and title_2 fields are of type ‘shingle’, defined as:

    <fieldType name="shingle" class="solr.TextField"
    positionIncrementGap="100" indexed="true" stored="true">
    <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory"
    maxShingleSize="2" outputUnigrams="false"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.ShingleFilterFactory"
    maxShingleSize="2" outputUnigrams="false"/>
    </analyzer>
    </fieldType>

    Note that I also have a catchall field which is text. I have qf set
    to: 'id^2 catchall' and pf set to: 'title_1^1.5 title_2^1.2'

    If I am missing something or doing something wrong please let me know.

    -Ethan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ethan Collins at Jul 14, 2010 at 9:28 am
    Hi Steve,

    Thanks, wrapping with PositionFilter actually worked the search and
    score -- I made a mistake while re-indexing last time.

    Trying to analyze PositionFilter: didn't understand why earlier the
    search of 'Nina Simone I Put' failed since atleast the phrase 'Nina
    Simone' should have matched against title_0 field. Any clue?

    I am also trying to understand the impact of PositionFilter on phrase
    search quality and score. Unfortunately there are not enough
    literature/help put up by google.

    -Ethan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ethan Collins at Jul 14, 2010 at 10:00 am

    Trying to analyze PositionFilter: didn't understand why earlier the
    search of 'Nina Simone I Put' failed since atleast the phrase 'Nina
    Simone' should have matched against title_0 field. Any clue?
    Please note that I have configure the ShingleFilter as bigrams without unigrams.

    [Honestly, I am still struggling to understand how this worked and the
    earlier one didn't]

    -Ethan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 13, '10 at 7:43a
activeJul 14, '10 at 10:00a
posts5
users2
websitelucene.apache.org

2 users in discussion

Ethan Collins: 4 posts Steven A Rowe: 1 post

People

Translate

site design / logo © 2022 Grokbase