FAQ
Hi, I'm building a system where I want to show only results indexed in the
past few days. Furthermore, I don't want to maintain a giant index with
millions of documents if I only want to return results from a couple of days
(thousands of documents).

My system heavily relies that the occurrences of terms in documents stored
in the index have a realistic distribution (consequently: realistic IDF).

That said, I would like to use a small index to return results, but I want
to compute documents score using a IDF from a much greater Index (or even an
external source).

The Similarity API doesn't seem to allow me to do this. The *idf* method
does not receive as parameter the term being used.

Another possibility is to use TrieRangeQuery to make sure the documents
shown are within the last couple of days. Again, I rather not mantain a
large index. Also this kind of query is not cheap.

Am I missing something?

Thanks


Felipe Hummel

Search Discussions

  • Andrzej Bialecki at Mar 10, 2011 at 8:37 pm

    On 3/10/11 8:32 PM, Felipe Hummel wrote:
    Hi, I'm building a system where I want to show only results indexed in the
    past few days. Furthermore, I don't want to maintain a giant index with
    millions of documents if I only want to return results from a couple of days
    (thousands of documents).

    My system heavily relies that the occurrences of terms in documents stored
    in the index have a realistic distribution (consequently: realistic IDF).

    That said, I would like to use a small index to return results, but I want
    to compute documents score using a IDF from a much greater Index (or even an
    external source).

    The Similarity API doesn't seem to allow me to do this. The *idf* method
    does not receive as parameter the term being used.

    Another possibility is to use TrieRangeQuery to make sure the documents
    shown are within the last couple of days. Again, I rather not mantain a
    large index. Also this kind of query is not cheap.

    Am I missing something?
    Take a look at SOLR-1632. Indeed, it's not possible to do this using
    Similarity alone. You will need something like the DFSource class in
    that patch, i.e. a subclass of IndexSearcher, where you populate
    term->DF map with values obtained from the full index, and then you use
    this map to calculate IDF.

    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Felipe Hummel at Mar 14, 2011 at 2:02 am
    In stackoverflow somebody answer me this:

    You should be able to extend IndexReader and override the docFreq() methods
    to provide whatever values you'd like. One thing this implementation can do
    is open two IndexReader instances -- one for the small index and one for the
    large index. All the methods are delegated to the small IndexReader, except
    for docFreq(), which is delegated to the large index. You'll need to scale
    the value returned, i.e.
    int myNewDocFreq = (bigIndexReader.docFreq(t) /
    bigIndexReader.maxDoc()) *smallIndexReader
    .maxDoc()


    It seems all right to me. Is that correct?
    On Thu, Mar 10, 2011 at 4:36 PM, Andrzej Bialecki wrote:
    On 3/10/11 8:32 PM, Felipe Hummel wrote:

    Hi, I'm building a system where I want to show only results indexed in the
    past few days. Furthermore, I don't want to maintain a giant index with
    millions of documents if I only want to return results from a couple of
    days
    (thousands of documents).

    My system heavily relies that the occurrences of terms in documents stored
    in the index have a realistic distribution (consequently: realistic IDF).

    That said, I would like to use a small index to return results, but I want
    to compute documents score using a IDF from a much greater Index (or even
    an
    external source).

    The Similarity API doesn't seem to allow me to do this. The *idf* method
    does not receive as parameter the term being used.

    Another possibility is to use TrieRangeQuery to make sure the documents
    shown are within the last couple of days. Again, I rather not mantain a
    large index. Also this kind of query is not cheap.

    Am I missing something?
    Take a look at SOLR-1632. Indeed, it's not possible to do this using
    Similarity alone. You will need something like the DFSource class in that
    patch, i.e. a subclass of IndexSearcher, where you populate term->DF map
    with values obtained from the full index, and then you use this map to
    calculate IDF.

    --
    Best regards,
    Andrzej Bialecki <><
    ___. ___ ___ ___ _ _ __________________________________
    [__ || __|__/|__||\/| Information Retrieval, Semantic Web
    ___|||__|| \| || | Embedded Unix, System Integration
    http://www.sigram.com Contact: info at sigram dot com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 10, '11 at 7:33p
activeMar 14, '11 at 2:02a
posts3
users2
websitelucene.apache.org

2 users in discussion

Felipe Hummel: 2 posts Andrzej Bialecki: 1 post

People

Translate

site design / logo © 2023 Grokbase