FAQ
Hi,

I am trying to compute the counts of terms of the documents returned
by running a query using a TermVectorMapper.
I was wondering if anyone knew if there was a faster way to do this
rather than using a HashMap with a TermVectorMapper to store the
counts of the terms and calling getTermFreqVector().
I do not require the term frequency within a document.

Thanks,
Thomas

HashMap termDocCount = new HashMap();
TermQuery tagQuery = new TermQuery(tagTerm);
TopDocs docs = searcher.search(tagQuery, numDocs);
for (int i=0 ; i<docs.scoreDocs.length; ++i) {
ScoreDoc sdoc=docs.scoreDocs[i];
Document doc = ir.document(sdoc.doc);
//iterate over a subset of index fields
for (int j=0; j <fieldNames.length; ++j) {
String fieldName=fieldNames[j];
DocTermVectorMapper vMapper=new DocTermVectorMapper(termDocCount);
ir.getTermFreqVector(sdoc.doc, fieldName,vMapper);
}
}

private class DocTermVectorMapper extends TermVectorMapper {

private HashMap termDocCount;
private String currField;

DocTermVectorMapper(HashMap termDocCount) {
this.termDocCount=termDocCount;
}

public boolean isIgnoringOffsets() {
return true;
}

public boolean isIgnoringPositions() {
return true;
}

public void map(String term, int frequency, TermVectorOffsetInfo[]
offsets, int[] positions) {
Term t=new Term(currField,term);
if (!termDocCount.containsKey(t))
termDocCount.put(t, new Int());
else {
((Int)termDocCount.get(t)).x++;
}
}

public void setExpectations(String field, int numTerms, boolean
storeOffsets, boolean storePositions) {
currField=field;
}
}

private class Int {
int x;
Int() {
x = 1;
}
}

Search Discussions

  • Grant Ingersoll at Oct 14, 2009 at 1:15 pm

    On Oct 12, 2009, at 10:46 PM, Thomas D'Silva wrote:

    Hi,

    I am trying to compute the counts of terms of the documents returned
    by running a query using a TermVectorMapper.
    I was wondering if anyone knew if there was a faster way to do this
    rather than using a HashMap with a TermVectorMapper to store the
    counts of the terms and calling getTermFreqVector().
    I do not require the term frequency within a document.
    I think that is as fast as its going to get unless you have some other
    restrictions that would allow you to use a FieldCache. Can you
    describe the bigger problem you are trying to solve?

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
    using Solr/Lucene:
    http://www.lucidimagination.com/search


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Wettin at Oct 15, 2009 at 1:16 pm

    14 okt 2009 kl. 15.15 skrev Grant Ingersoll:
    On Oct 12, 2009, at 10:46 PM, Thomas D'Silva wrote:

    I am trying to compute the counts of terms of the documents
    returned by running a query using a TermVectorMapper.
    I was wondering if anyone knew if there was a faster way to do this
    rather than using a HashMap with a TermVectorMapper to store the
    counts of the terms and calling getTermFreqVector().
    I do not require the term frequency within a document.
    I think that is as fast as its going to get unless you have some
    other restrictions that would allow you to use a FieldCache.
    Just thinking out loud here... How about extending the Query/Scorer
    and do some counting while executing the Query?


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Thomas D'Silva at Oct 15, 2009 at 2:04 pm
    Grant,

    I have an index with documents that have a text field containing
    document text, and a tag field containing tags associated with the
    document. I am trying to calculate the probability that a document
    contains a particular word and is tagged with a particular tag.
    This is related to a MoreLikeThis extension I was trying to write
    (http://issues.apache.org/jira/browse/LUCENE-1910)

    Most of the time is spent in the loop iterating over the document
    tagged with the particular tag, and computing counts of terms across
    the documents. If the index contains millions of documents, it takes a
    while to compute the document,tag probabilities.

    Thanks,
    Thomas

    On Wed, Oct 14, 2009 at 8:15 AM, Grant Ingersoll wrote:
    On Oct 12, 2009, at 10:46 PM, Thomas D'Silva wrote:

    Hi,

    I am trying to compute the counts of terms of the documents returned by
    running a query using a TermVectorMapper.
    I was wondering if anyone knew if there was a faster way to do this rather
    than using a HashMap with a TermVectorMapper to store the counts of the
    terms and calling getTermFreqVector().
    I do not require the term frequency within a document.
    I think that is as fast as its going to get unless you have some other
    restrictions that would allow you to use a FieldCache.    Can you describe
    the bigger problem you are trying to solve?

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com/

    Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
    Solr/Lucene:
    http://www.lucidimagination.com/search


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 13, '09 at 2:47a
activeOct 15, '09 at 2:04p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase