FAQ
Hello:

I am new to LUCENE and I am testing some issues about it. I can retrieve
the number of documents which satisfies a query, but I don't find how to
obtain the number of terms which match it.

For example, if I search for the word "house", I want to obtain the
number of times the word occurs (not the number of documents).

Is it possible to do it in LUCENE?

Thanks in advance,

Mario Barcala


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Dipesh at Nov 13, 2008 at 2:27 am
    yes its quite possible.
    1.you need to create term which you need to search.
    eg.
    Term term = new Term("yourfield","yourword");

    2. then create a TermDoc enum.
    TermDocs provides an interface for enumerating <document, frequency> pairs
    for a term.

    TermDocs t = new
    FilterIndexReader(IndexReader.open("youindex")).termDocs(term);

    3.Iterate through each of the terms and count the occurrence.
    int count = 0;
    while(td.next()){
    count+=td.freq());
    }

    Hope it helped,
    Regards,
    Dipesh
    On Thu, Nov 13, 2008 at 4:30 AM, Fco. Mario Barcala Rodríguez wrote:

    Hello:

    I am new to LUCENE and I am testing some issues about it. I can retrieve
    the number of documents which satisfies a query, but I don't find how to
    obtain the number of terms which match it.

    For example, if I search for the word "house", I want to obtain the
    number of times the word occurs (not the number of documents).

    Is it possible to do it in LUCENE?

    Thanks in advance,

    Mario Barcala


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ----------------------------------------
    "Help Ever Hurt Never"- Baba
  • Lbarcala at Nov 13, 2008 at 8:36 am

    yes its quite possible.
    1.you need to create term which you need to search.
    eg.
    Term term = new Term("yourfield","yourword");

    2. then create a TermDoc enum.
    TermDocs provides an interface for enumerating <document, frequency> pairs
    for a term.

    TermDocs t = new
    FilterIndexReader(IndexReader.open("youindex")).termDocs(term);

    3.Iterate through each of the terms and count the occurrence.
    int count = 0;
    while(td.next()){
    count+=td.freq());
    }
    This helps but, what about combining this with a search criteria? I mean
    to obtain the number of times the term "house" occurs in document between
    year 1999 and 2005 (another field of documents). I don't find anything
    related in classes used by you.
    Hello:

    I am new to LUCENE and I am testing some issues about it. I can retrieve
    the number of documents which satisfies a query, but I don't find how to
    obtain the number of terms which match it.

    For example, if I search for the word "house", I want to obtain the
    number of times the word occurs (not the number of documents).

    Is it possible to do it in LUCENE?

    Thanks in advance,

    Mario Barcala


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ----------------------------------------
    "Help Ever Hurt Never"- Baba


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Nov 13, 2008 at 3:03 pm
    Mario,

    Does this help:
    http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/TermFreqVector.html

    Plus:
    http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/index/IndexReader.html#method_summary
    (look for "getTerm.Freq...")

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




    ________________________________
    From: "lbarcala@freeresearch.org" <lbarcala@freeresearch.org>
    To: java-user@lucene.apache.org
    Sent: Thursday, November 13, 2008 3:35:24 AM
    Subject: Re: About counting term hits
    yes its quite possible.
    1.you need to create term which you need to search.
    eg.
    Term term = new Term("yourfield","yourword");

    2. then create a TermDoc enum.
    TermDocs provides an interface for enumerating <document, frequency> pairs
    for a term.

    TermDocs t = new
    FilterIndexReader(IndexReader.open("youindex")).termDocs(term);

    3.Iterate through each of the terms and count the occurrence.
    int count = 0;
    while(td.next()){
    count+=td.freq());
    }
    This helps but, what about combining this with a search criteria? I mean
    to obtain the number of times the term "house" occurs in document between
    year 1999 and 2005 (another field of documents). I don't find anything
    related in classes used by you.
    Hello:

    I am new to LUCENE and I am testing some issues about it. I can retrieve
    the number of documents which satisfies a query, but I don't find how to
    obtain the number of terms which match it.

    For example, if I search for the word "house", I want to obtain the
    number of times the word occurs (not the number of documents).

    Is it possible to do it in LUCENE?

    Thanks in advance,

    Mario Barcala


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ----------------------------------------
    "Help Ever Hurt Never"- Baba


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lbarcala at Nov 13, 2008 at 4:36 pm

    So, if I undertand the solution, the main steps to do what I propose is:

    1) To obtain the documents which match the query (documents which include
    the word "house")
    2) To loop throw matching documents to access the IndexReader for
    obtaining their term frequencies.
    3) To obtain from TermFreqVector the frequencies of the Term ("house") to
    calculate the result.

    And, if it is a very frequent query and there are much documents (> 10.000),
    would LUCENE solve it in a reasonable time? A query might match several
    hundred documents.

    Thank you,

    Mario Barcala
    Hello:

    I am new to LUCENE and I am testing some issues about it. I can
    retrieve
    the number of documents which satisfies a query, but I don't find how
    to
    obtain the number of terms which match it.

    For example, if I search for the word "house", I want to obtain the
    number of times the word occurs (not the number of documents).

    Is it possible to do it in LUCENE?

    Thanks in advance,

    Mario Barcala


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ----------------------------------------
    "Help Ever Hurt Never"- Baba


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Nov 13, 2008 at 11:13 pm
    The more Documents you have to look at the slower it will be, but it may still be fast enough - it's impossible to tell without considering index size, query volume, hardware, number of hits/Docs, etc.


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch




    ________________________________
    From: "lbarcala@freeresearch.org" <lbarcala@freeresearch.org>
    To: java-user@lucene.apache.org
    Sent: Thursday, November 13, 2008 11:35:13 AM
    Subject: Re: About counting term hits
    So, if I undertand the solution, the main steps to do what I propose is:

    1) To obtain the documents which match the query (documents which include
    the word "house")
    2) To loop throw matching documents to access the IndexReader for
    obtaining their term frequencies.
    3) To obtain from TermFreqVector the frequencies of the Term ("house") to
    calculate the result.

    And, if it is a very frequent query and there are much documents (> 10.000),
    would LUCENE solve it in a reasonable time? A query might match several
    hundred documents.

    Thank you,

    Mario Barcala
    Hello:

    I am new to LUCENE and I am testing some issues about it. I can
    retrieve
    the number of documents which satisfies a query, but I don't find how
    to
    obtain the number of terms which match it.

    For example, if I search for the word "house", I want to obtain the
    number of times the word occurs (not the number of documents).

    Is it possible to do it in LUCENE?

    Thanks in advance,

    Mario Barcala


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ----------------------------------------
    "Help Ever Hurt Never"- Baba


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 14, 2008 at 3:50 pm
    I think to do this efficiently you'd need to modify Lucene's builtin
    query classes (eg TermQuery) such that during the scoring process, in
    addition to simply computing its contribution to the document's score,
    it would also record further information like total number of
    occurrences of each term, which docs had which terms, etc.

    I don't think there's a simple efficient way to do this with Lucene
    today, though if your result sets are small enough term vectors might
    be fine.

    Mike

    lbarcala@freeresearch.org wrote:
    So, if I undertand the solution, the main steps to do what I propose
    is:

    1) To obtain the documents which match the query (documents which
    include
    the word "house")
    2) To loop throw matching documents to access the IndexReader for
    obtaining their term frequencies.
    3) To obtain from TermFreqVector the frequencies of the Term
    ("house") to
    calculate the result.

    And, if it is a very frequent query and there are much documents (>
    10.000),
    would LUCENE solve it in a reasonable time? A query might match
    several
    hundred documents.

    Thank you,

    Mario Barcala
    Hello:

    I am new to LUCENE and I am testing some issues about it. I can
    retrieve
    the number of documents which satisfies a query, but I don't find
    how
    to
    obtain the number of terms which match it.

    For example, if I search for the word "house", I want to obtain the
    number of times the word occurs (not the number of documents).

    Is it possible to do it in LUCENE?

    Thanks in advance,

    Mario Barcala


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ----------------------------------------
    "Help Ever Hurt Never"- Baba


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Nov 16, 2008 at 1:37 am
    : I think to do this efficiently you'd need to modify Lucene's builtin query
    : classes (eg TermQuery) such that during the scoring process, in addition to
    : simply computing its contribution to the document's score, it would also
    : record further information like total number of occurrences of each term,
    : which docs had which terms, etc.

    Unless you don't care about the score at all, just the total count. In
    which case you can use a custom Similarity to make the score for a doc be
    the count (ignore idf, norms, queryNorm, etc...) Then use a hit collector
    that sums the counts for every doc matched.

    that should be as efficient as possible (it's certianly only one pass) but
    you might be able to optimize it by using your other criteria
    (date range or whatever) in a Filter to generate a BitSet, then fetch a
    TermDocs instance for your term, and iterate through the docs summing up
    the frequencies (you can use skipTo(set.nextSetBit()) to optimize away
    non-matching docs)

    (at least i'm pretty sure that would work)

    Another nuance to this question...

    : > > > > I am new to LUCENE and I am testing some issues about it. I can
    : > > > > retrieve
    : > > > > the number of documents which satisfies a query, but I don't find how
    : > > > > to
    : > > > > obtain the number of terms which match it.

    The words "term" and "query" mean very specific, and independent, things
    in lucene, ... but Mario seems to be using them interchangably -- if you
    want to know how often a Term filtered appears in all docs matching some
    criteria, then all of the techniques described so far should work.

    but if you want to count the occurances of a more complicated Query (like:
    how many times does the phrase "Mario Barcala" appear in docs from
    199-2003) the situation gets more complicated ... for that you would want
    ot use something like a SpanQuery and iterate through the Spans (counting
    them as you go)



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 12, '08 at 7:31p
activeNov 16, '08 at 1:37a
posts8
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase