FAQ
Hi,
Please give me advise how to create custom scoring. I need to result that
documents were in order, depending on how popular each term in the document
(popular = how many times it appears in the index) and length of the
document (less terms - higher in search results).

For example, index contains following data:

ID | SEARCH_FIELD
------------------------------
1 | Russia
2 | Russia, Moscow
3 | Russia, Volgograd
4 | Russia, Ivanovo
5 | Russia, Ivanovo, Altayskaya street 45
6 | Russia, Moscow, Kremlin
7 | Russia, Moscow, Altayskaya street
8 | Russia, Moscow, Altayskaya street 15
9 | Russia, Moscow, Altayskaya street 15/26


And I should get next results:


Query | Document result set
----------------------------------------------
Russia | 1,2,4,3,6,7,8,9,5
Moscow | 2,6,7,8,9
Ivanovo | 4,5
Altayskaya | 7,8,9,5

In fact --- it is a search for geographic objects (cities, streets, houses).
At the same time can be given only part of the address, and the results
should appear the most relevant results.

Thanks.
--
Pavel Minchenkov

Search Discussions

  • Ian Lea at Dec 15, 2010 at 3:44 pm
    Sounds to me that lucene should do a pretty good job without any extra
    work on your part. See javadocs for
    org.apache.lucene.search.Similarity
    for details on how it works. You can change things by providing your
    own implementation.

    There is also the org.apache.lucene.search.function package but that
    is much more complex.


    A web search for "lucene scoring" should find you lots of info.


    --
    Ian.

    On Wed, Dec 15, 2010 at 3:28 PM, Pavel Minchenkov wrote:
    Hi,
    Please give me advise how to create custom scoring. I need to result that
    documents were in order, depending on how popular each term in the document
    (popular = how many times it appears in the index) and length of the
    document (less terms - higher in search results).

    For example, index contains following data:

    ID    | SEARCH_FIELD
    ------------------------------
    1     | Russia
    2     | Russia, Moscow
    3     | Russia, Volgograd
    4     | Russia, Ivanovo
    5     | Russia, Ivanovo, Altayskaya street 45
    6     | Russia, Moscow, Kremlin
    7     | Russia, Moscow, Altayskaya street
    8     | Russia, Moscow, Altayskaya street 15
    9     | Russia, Moscow, Altayskaya street 15/26


    And I should get next results:


    Query                     | Document result set
    ----------------------------------------------
    Russia                    | 1,2,4,3,6,7,8,9,5
    Moscow                  | 2,6,7,8,9
    Ivanovo                    | 4,5
    Altayskaya              | 7,8,9,5

    In fact --- it is a search for geographic objects (cities, streets, houses).
    At the same time can be given only part of the address, and the results
    should appear the most relevant results.

    Thanks.
    --
    Pavel Minchenkov
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doron Cohen at Dec 15, 2010 at 4:10 pm
    Also, when taking the Similarity suggestion below note two things in
    Lucene's default behavior that you seem to wish to avoid:

    The first is IDF - but only for multi-term queries - otherwise ignore this
    comment.
    For multi term queries to only consider term frequency and doc length, you
    may want to always return 1 for idf() in your Similarity impl (otherwise
    terms appearing in more documents will contribute less to the score, which
    you seem to wish to avoid).

    The second is doc length normalization inaccuracy - as doc lengths are
    encoded lossly at search time Lucene might not distinguish the difference
    between two documents whose lengths are almost the same. For this, at
    indexing time, your Similarity impl for lengthNorm() could be e.g. 1/(10 *
    numTokens) - this way reducing the chances that two docs of different length
    have the same search time norm.

    Doron
    On Wed, Dec 15, 2010 at 5:43 PM, Ian Lea wrote:

    Sounds to me that lucene should do a pretty good job without any extra
    work on your part. See javadocs for
    org.apache.lucene.search.Similarity
    for details on how it works. You can change things by providing your
    own implementation.

    There is also the org.apache.lucene.search.function package but that
    is much more complex.


    A web search for "lucene scoring" should find you lots of info.


    --
    Ian.

    On Wed, Dec 15, 2010 at 3:28 PM, Pavel Minchenkov wrote:
    Hi,
    Please give me advise how to create custom scoring. I need to result that
    documents were in order, depending on how popular each term in the document
    (popular = how many times it appears in the index) and length of the
    document (less terms - higher in search results).

    For example, index contains following data:

    ID | SEARCH_FIELD
    ------------------------------
    1 | Russia
    2 | Russia, Moscow
    3 | Russia, Volgograd
    4 | Russia, Ivanovo
    5 | Russia, Ivanovo, Altayskaya street 45
    6 | Russia, Moscow, Kremlin
    7 | Russia, Moscow, Altayskaya street
    8 | Russia, Moscow, Altayskaya street 15
    9 | Russia, Moscow, Altayskaya street 15/26


    And I should get next results:


    Query | Document result set
    ----------------------------------------------
    Russia | 1,2,4,3,6,7,8,9,5
    Moscow | 2,6,7,8,9
    Ivanovo | 4,5
    Altayskaya | 7,8,9,5

    In fact --- it is a search for geographic objects (cities, streets, houses).
    At the same time can be given only part of the address, and the results
    should appear the most relevant results.

    Thanks.
    --
    Pavel Minchenkov
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Dec 15, 2010 at 9:24 pm
    Have a look at http://lucene.apache.org/java/3_0_2/scoring.html on how Lucene's scoring works. You can override the Similarity class in Solr as well via the schema.xml file.
    On Dec 15, 2010, at 10:28 AM, Pavel Minchenkov wrote:

    Hi,
    Please give me advise how to create custom scoring. I need to result that
    documents were in order, depending on how popular each term in the document
    (popular = how many times it appears in the index) and length of the
    document (less terms - higher in search results).

    For example, index contains following data:

    ID | SEARCH_FIELD
    ------------------------------
    1 | Russia
    2 | Russia, Moscow
    3 | Russia, Volgograd
    4 | Russia, Ivanovo
    5 | Russia, Ivanovo, Altayskaya street 45
    6 | Russia, Moscow, Kremlin
    7 | Russia, Moscow, Altayskaya street
    8 | Russia, Moscow, Altayskaya street 15
    9 | Russia, Moscow, Altayskaya street 15/26


    And I should get next results:


    Query | Document result set
    ----------------------------------------------
    Russia | 1,2,4,3,6,7,8,9,5
    Moscow | 2,6,7,8,9
    Ivanovo | 4,5
    Altayskaya | 7,8,9,5

    In fact --- it is a search for geographic objects (cities, streets, houses).
    At the same time can be given only part of the address, and the results
    should appear the most relevant results.

    Thanks.
    --
    Pavel Minchenkov
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Alexey Serba at Dec 19, 2010 at 9:35 pm
    Hi Pavel,

    I had the similar problem several years ago - I had to find
    geographical locations in textual descriptions, geocode these objects
    to lat/long during indexing process and allow users to filter/sort
    search results to specific geographical areas. The important issue was
    that there were several types of geographical objects - street < town
    < region < country. The idea was to geocode to most narrow
    geographical area as possible. Relevance logic in this case could be
    specified as "find the most narrow result that is unique identified by
    your text or search query". So I came up with custom algorithm that
    was quite good in terms of performance and precision/recall. Here's
    the simple description:
    * You can intersect all text/searchquery terms with locations
    dictionary to find only geo terms
    * Search in your locations Lucene index and filter only street objects
    (the most narrow areas). Due to tf*idf formula you'll get the most
    relevant results. Then you need to post process N (3/5/10) results and
    verify that they are matches indeed. I did intersect search terms with
    result's terms and make another lucene search to verify if these terms
    are unique identifying the match. If it's then return matching street.
    If there's no any match proceed using the same algorithm with towns,
    regions, countries.

    HTH,
    Alexey
    On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov wrote:
    Hi,
    Please give me advise how to create custom scoring. I need to result that
    documents were in order, depending on how popular each term in the document
    (popular = how many times it appears in the index) and length of the
    document (less terms - higher in search results).

    For example, index contains following data:

    ID    | SEARCH_FIELD
    ------------------------------
    1     | Russia
    2     | Russia, Moscow
    3     | Russia, Volgograd
    4     | Russia, Ivanovo
    5     | Russia, Ivanovo, Altayskaya street 45
    6     | Russia, Moscow, Kremlin
    7     | Russia, Moscow, Altayskaya street
    8     | Russia, Moscow, Altayskaya street 15
    9     | Russia, Moscow, Altayskaya street 15/26


    And I should get next results:


    Query                     | Document result set
    ----------------------------------------------
    Russia                    | 1,2,4,3,6,7,8,9,5
    Moscow                  | 2,6,7,8,9
    Ivanovo                    | 4,5
    Altayskaya              | 7,8,9,5

    In fact --- it is a search for geographic objects (cities, streets, houses).
    At the same time can be given only part of the address, and the results
    should appear the most relevant results.

    Thanks.
    --
    Pavel Minchenkov
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 15, '10 at 3:29p
activeDec 19, '10 at 9:35p
posts5
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase