FAQ
Hello Lucene Experts,

I wonder if someone might be able to shed some insight on this interesting scoring question:

The problem:
Build a search query that will return [ordered] hits by the top number of occurences of field values across matched documents (or as close to this as possible).
The built-in scoring is great for scoring number of hits within a document, but is there an efficient way to do this across the same field in a set of matched documents? (maybe scoring isn't the best way?)

Example:
Let's say you have an index containing book information. Each document has a 'title' field.
Let's say the index contains 100 entries, with:
65 'title's containing the word 'tiger'
21 containing 'lion'
6 containing 'panther'
5 containing 'kitten'
3 containing 'slug'

What would be the best way to build a query such that returned documents are ordered in this way:
Rank Value Occurences
================================
1 tiger 65
2 lion 21
3 panther 6
4 kitten 5
5 slug 3

I can, of course, build a standard query, traverse the returned documents and build such a list, but if the returned query had many 100,000's of hits, the performance would degrade linearly, particularly if only the 'Top 5' are actually required.


One idea is to maintain a separate index with this information - the main problem with this is that you essentially need to know what you're searching for at index-time, which isn't ideal.


Has anyone come across and solved this particular issue using Lucene?

Many thanks,
Peter



_________________________________________________________________
Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
http://clk.atdmt.com/UKM/go/186394592/direct/01/

Search Discussions

  • Jake Mannix at Nov 22, 2009 at 5:20 pm
    Peter,

    You want to do a facet query. This kind of functionality is not in
    Lucene-core (sadly), but both Solr (the fully featured search application
    built on Lucene) and bobo-browse (just a library, like Lucene itself) are
    open-source and work with Lucene to provide faceting capabilities for you.

    -jake
    On Sun, Nov 22, 2009 at 8:42 AM, Peter 4U wrote:


    Hello Lucene Experts,

    I wonder if someone might be able to shed some insight on this interesting
    scoring question:

    The problem:
    Build a search query that will return [ordered] hits by the top number of
    occurences of field values across matched documents (or as close to this as
    possible).
    The built-in scoring is great for scoring number of hits within a document,
    but is there an efficient way to do this across the same field in a set of
    matched documents? (maybe scoring isn't the best way?)

    Example:
    Let's say you have an index containing book information. Each document has
    a 'title' field.
    Let's say the index contains 100 entries, with:
    65 'title's containing the word 'tiger'
    21 containing 'lion'
    6 containing 'panther'
    5 containing 'kitten'
    3 containing 'slug'

    What would be the best way to build a query such that returned documents
    are ordered in this way:
    Rank Value Occurences
    ================================
    1 tiger 65
    2 lion 21
    3 panther 6
    4 kitten 5
    5 slug 3

    I can, of course, build a standard query, traverse the returned documents
    and build such a list, but if the returned query had many 100,000's of hits,
    the performance would degrade linearly, particularly if only the 'Top 5' are
    actually required.


    One idea is to maintain a separate index with this information - the main
    problem with this is that you essentially need to know what you're searching
    for at index-time, which isn't ideal.


    Has anyone come across and solved this particular issue using Lucene?

    Many thanks,
    Peter



    _________________________________________________________________
    Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
    http://clk.atdmt.com/UKM/go/186394592/direct/01/
  • Peter 4U at Nov 22, 2009 at 5:45 pm
    Hi Jake,



    Many thanks for your quick reply.

    I shall check these out.



    Thanks!

    Peter


    Date: Sun, 22 Nov 2009 09:20:24 -0800
    Subject: Re: Top field count scoring across documents
    From: [email protected]
    To: [email protected]

    Peter,

    You want to do a facet query. This kind of functionality is not in
    Lucene-core (sadly), but both Solr (the fully featured search application
    built on Lucene) and bobo-browse (just a library, like Lucene itself) are
    open-source and work with Lucene to provide faceting capabilities for you.

    -jake
    On Sun, Nov 22, 2009 at 8:42 AM, Peter 4U wrote:


    Hello Lucene Experts,

    I wonder if someone might be able to shed some insight on this interesting
    scoring question:

    The problem:
    Build a search query that will return [ordered] hits by the top number of
    occurences of field values across matched documents (or as close to this as
    possible).
    The built-in scoring is great for scoring number of hits within a document,
    but is there an efficient way to do this across the same field in a set of
    matched documents? (maybe scoring isn't the best way?)

    Example:
    Let's say you have an index containing book information. Each document has
    a 'title' field.
    Let's say the index contains 100 entries, with:
    65 'title's containing the word 'tiger'
    21 containing 'lion'
    6 containing 'panther'
    5 containing 'kitten'
    3 containing 'slug'

    What would be the best way to build a query such that returned documents
    are ordered in this way:
    Rank Value Occurences
    ================================
    1 tiger 65
    2 lion 21
    3 panther 6
    4 kitten 5
    5 slug 3

    I can, of course, build a standard query, traverse the returned documents
    and build such a list, but if the returned query had many 100,000's of hits,
    the performance would degrade linearly, particularly if only the 'Top 5' are
    actually required.


    One idea is to maintain a separate index with this information - the main
    problem with this is that you essentially need to know what you're searching
    for at index-time, which isn't ideal.


    Has anyone come across and solved this particular issue using Lucene?

    Many thanks,
    Peter



    _________________________________________________________________
    Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
    http://clk.atdmt.com/UKM/go/186394592/direct/01/
    _________________________________________________________________
    Use Hotmail to send and receive mail from your different email accounts
    http://clk.atdmt.com/UKM/go/186394592/direct/01/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 22, '09 at 4:43p
activeNov 22, '09 at 5:45p
posts3
users2
websitelucene.apache.org

2 users in discussion

Peter 4U: 2 posts Jake Mannix: 1 post

People

Translate

site design / logo © 2023 Grokbase