FAQ
Hi,

I would like to change the IDF value of the Lucene similarity
computation to "inverse document frequency inside category". Not the
complete collection should be considered, but only the documents that
have a certain category. The categories are stored as separate fields.

The implementation below works, but it is kind of slow. I was
wondering if there is a more efficient way than to read the DocIdSet
from the index for each term.

Thanks in advance for any pointers you might have!
Regards,
Max

public class InCategorySimilarity extends DefaultSimilarity {

public InCategorySimilarity() {}

// These objects have to be here so that they are visible across
multiple executions of idfExplain
OpenBitSet categoryIdSet;
long catDocs = 1;

@Override
public Explanation.IDFExplanation idfExplain(final Term term,
final Searcher searcher) throws IOException {
return new Explanation.IDFExplanation() {
long termCategoryFreq = 0;
boolean isCategoryField = term.field().equals("CATEGORY");

private long termCategoryFreq() {
try {
IndexReader reader = ((IndexSearcher)
searcher).getIndexReader();
TermsFilter filter = new TermsFilter();
filter.addTerm(term);
OpenBitSet docSet = (OpenBitSet) filter.getDocIdSet(reader);

if (isCategoryField) {
categoryIdSet = docSet;
catDocs = categoryIdSet.cardinality();
} else {
docSet.and(categoryIdSet);
}
termCategoryFreq = docSet.cardinality();
} catch (IOException e) {
//handle
}
return termCategoryFreq;
}

public float invCatFreq(long termCategoryFreq, long catDocs) {
return termCategoryFreq==0 ? 0 : (float) (Math.log(new
Float(catDocs) / new Float(termCategoryFreq)) + 1.0);
}

@Override
public float getIdf() {
termCategoryFreq = termCategoryFreq();
float invCatFreq = invCatFreq(termCategoryFreq, catDocs);
return invCatFreq;
}
};
}
}

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Mark harwood at Oct 18, 2010 at 12:32 pm
    Can you not just call reader.docFreq(categoryTerm) ?

    The returned figure includes deleted docs but then the search term uses this
    method too so should suffer from the same inaccuracy.

    Cheers
    Mark



    ----- Original Message ----
    From: Max Jakob <[email protected]>
    To: [email protected]
    Sent: Mon, 18 October, 2010 12:26:33
    Subject: Consider only documents of a category for IDF

    Hi,

    I would like to change the IDF value of the Lucene similarity
    computation to "inverse document frequency inside category". Not the
    complete collection should be considered, but only the documents that
    have a certain category. The categories are stored as separate fields.

    The implementation below works, but it is kind of slow. I was
    wondering if there is a more efficient way than to read the DocIdSet
    from the index for each term.

    Thanks in advance for any pointers you might have!
    Regards,
    Max

    public class InCategorySimilarity extends DefaultSimilarity {

    public InCategorySimilarity() {}

    // These objects have to be here so that they are visible across
    multiple executions of idfExplain
    OpenBitSet categoryIdSet;
    long catDocs = 1;

    @Override
    public Explanation.IDFExplanation idfExplain(final Term term,
    final Searcher searcher) throws IOException {
    return new Explanation.IDFExplanation() {
    long termCategoryFreq = 0;
    boolean isCategoryField = term.field().equals("CATEGORY");

    private long termCategoryFreq() {
    try {
    IndexReader reader = ((IndexSearcher)
    searcher).getIndexReader();
    TermsFilter filter = new TermsFilter();
    filter.addTerm(term);
    OpenBitSet docSet = (OpenBitSet) filter.getDocIdSet(reader);

    if (isCategoryField) {
    categoryIdSet = docSet;
    catDocs = categoryIdSet.cardinality();
    } else {
    docSet.and(categoryIdSet);
    }
    termCategoryFreq = docSet.cardinality();
    } catch (IOException e) {
    //handle
    }
    return termCategoryFreq;
    }

    public float invCatFreq(long termCategoryFreq, long catDocs) {
    return termCategoryFreq==0 ? 0 : (float) (Math.log(new
    Float(catDocs) / new Float(termCategoryFreq)) + 1.0);
    }

    @Override
    public float getIdf() {
    termCategoryFreq = termCategoryFreq();
    float invCatFreq = invCatFreq(termCategoryFreq, catDocs);
    return invCatFreq;
    }
    };
    }
    }

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Max Jakob at Oct 18, 2010 at 2:33 pm
    Thanks Mark, the call reader.docFreq(categoryTerm) is certainly a good
    way to get the nominator part of the IDF formula
    (http://en.wikipedia.org/wiki/Tf%E2%80%93idf#Mathematical_details).

    However, what is left to get is the denominator. For this I want the
    number of in-category documents that each term appears in (again,
    categories are in a separate field). Calling reader.docFreq(term) for
    this would give me the document frequency in the complete collection,
    but I only want the number of documents that the term appears within a
    category.

    So for a query +CATEGORY:sport TEXT:Johnson, I would like to set the IDF to
    log( (number of all sports documents)
    / (number of all sports documents that contain Johnson) )

    Is there an efficient way for doing this?

    Cheers,
    Max
    On Mon, Oct 18, 2010 at 2:32 PM, mark harwood wrote:
    Can you not just call reader.docFreq(categoryTerm) ?

    The returned figure includes deleted docs but then the search term uses this
    method too so should suffer from the same inaccuracy.

    Cheers
    Mark



    ----- Original Message ----
    From: Max Jakob <[email protected]>
    To: [email protected]
    Sent: Mon, 18 October, 2010 12:26:33
    Subject: Consider only documents of a category for IDF

    Hi,

    I would like to change the IDF value of the Lucene similarity
    computation to "inverse document frequency inside category". Not the
    complete collection should be considered, but only the documents that
    have a certain category. The categories are stored as separate fields.

    The implementation below works, but it is kind of slow. I was
    wondering if there is a more efficient way than to read the DocIdSet
    from the index for each term.

    Thanks in advance for any pointers you might have!
    Regards,
    Max

    public class InCategorySimilarity extends DefaultSimilarity {

    public InCategorySimilarity() {}

    // These objects have to be here so that they are visible across
    multiple executions of idfExplain
    OpenBitSet categoryIdSet;
    long catDocs = 1;

    @Override
    public Explanation.IDFExplanation idfExplain(final Term term,
    final Searcher searcher) throws IOException {
    return new Explanation.IDFExplanation() {
    long termCategoryFreq = 0;
    boolean isCategoryField = term.field().equals("CATEGORY");

    private long termCategoryFreq() {
    try {
    IndexReader reader = ((IndexSearcher)
    searcher).getIndexReader();
    TermsFilter filter = new TermsFilter();
    filter.addTerm(term);
    OpenBitSet docSet = (OpenBitSet) filter.getDocIdSet(reader);

    if (isCategoryField) {
    categoryIdSet = docSet;
    catDocs = categoryIdSet.cardinality();
    } else {
    docSet.and(categoryIdSet);
    }
    termCategoryFreq = docSet.cardinality();
    } catch (IOException e) {
    //handle
    }
    return termCategoryFreq;
    }

    public float invCatFreq(long termCategoryFreq, long catDocs) {
    return termCategoryFreq==0 ? 0 : (float) (Math.log(new
    Float(catDocs) / new Float(termCategoryFreq)) + 1.0);
    }

    @Override
    public float getIdf() {
    termCategoryFreq = termCategoryFreq();
    float invCatFreq = invCatFreq(termCategoryFreq, catDocs);
    return invCatFreq;
    }
    };
    }
    }

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 18, '10 at 11:27a
activeOct 18, '10 at 2:33p
posts3
users2
websitelucene.apache.org

2 users in discussion

Max Jakob: 2 posts Mark harwood: 1 post

People

Translate

site design / logo © 2023 Grokbase