FAQ
Hi there,

I have my 25 indexes of 1.8GB each read with MultiReader.
I try to get the document frequency of all the terms in specific documents
and it takes quite a long time - a document with 1000 terms takes around
4:30 minutes to calculate all the document frequencies of its terms - and
there are longer documents than that.

Since I have quite a lot of documents to process (around 12000) - it'll take
forever.
My function of getting the document frequency is listed below (it's for one
single term - but it's called for all the terms in the document term vector.

public int getdocumentfrequency (String termstr) throws Exception
{
Term term=new Term("contents", termstr);
TermEnum termenum=multireader.terms(term);
int freq=termenum.docFreq();
return freq;
}

Is there a better (i.e. faster) way to get all the document frequencies of a
specific document?

thanks a lot,
Nir.

--
View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12009334
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Daniel Naber at Aug 6, 2007 at 7:10 am

    On Monday 06 August 2007 01:40, tierecke wrote:

    Term term=new Term("contents", termstr);
    TermEnum termenum=multireader.terms(term);
    int freq=termenum.docFreq();
    IndexReader has a docFreq() method, no need to get a Term enumeration.

    regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Tierecke at Aug 6, 2007 at 11:13 am
    Thanks Daniel, you are completely right.
    I changed the code - but it doesn't make it [noticeably faster] - probably
    behind the scene it does run on the enum.
    I added some kind of hash table that keeps the docfreq already read so if I
    meet it again in another document I can retrieve it quickly - is there
    another solution? Maybe have a separate Lucene index for this? (In this case
    - can I read and write to the same index without closing it and reopening
    it? I want to read from it and if I don't find the docfreq there, calculate
    it and put it in the index).

    10x Nir.

    On Monday 06 August 2007 01:40, tierecke wrote:

    Term term=new Term("contents", termstr);
    TermEnum termenum=multireader.terms(term);
    int freq=termenum.docFreq();
    IndexReader has a docFreq() method, no need to get a Term enumeration.

    regards
    Daniel

    --
    View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12014472
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Tierecke at Aug 6, 2007 at 12:51 pm
    Does Lucene allow searching and indexing simultaneously?

    Yes. However, an IndexReader only searches the index as of the "point in
    time" that it was opened. Any updates to the index, either added or deleted
    documents, will not be visible until the IndexReader is re-opened. So your
    application must periodically re-open its IndexReaders to see the latest
    updates. The [WWW] IndexReader.isCurrent() method allows you to test whether
    any updates have occurred to the index since your IndexReader was opened.

    [from Lucene FAQ]

    Still I need to speed this process.


    Thanks Daniel, you are completely right.
    I changed the code - but it doesn't make it [noticeably faster] - probably
    behind the scene it does run on the enum.
    I added some kind of hash table that keeps the docfreq already read so if I
    meet it again in another document I can retrieve it quickly - is there
    another solution? Maybe have a separate Lucene index for this? (In this case
    - can I read and write to the same index without closing it and reopening
    it? I want to read from it and if I don't find the docfreq there, calculate
    it and put it in the index).

    10x Nir.


    Daniel Naber-10 wrote:
    On Monday 06 August 2007 01:40, tierecke wrote:

    Term term=new Term("contents", termstr);
    TermEnum termenum=multireader.terms(term);
    int freq=termenum.docFreq();
    IndexReader has a docFreq() method, no need to get a Term enumeration.

    regards
    Daniel
    --
    View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12015687
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Elschot at Aug 6, 2007 at 4:02 pm
    Nir,

    You can speed this up (maybe a lot) by moving the disk head(s)
    as little as possible.

    Have a look at the file formats of Lucene to get the idea.

    In your outer loop iterate over the readers of the multireader.
    For each reader iterate over the terms in sorted order.
    And don't access the index in any other way while doing this,
    that is, do no query searches and no updates.

    A bit of bookkeeping per term it will make it straightforward
    to compute the total document frequencies.

    Regards,
    Paul Elschot


    On Monday 06 August 2007 13:12, tierecke wrote:

    Thanks Daniel, you are completely right.
    I changed the code - but it doesn't make it [noticeably faster] - probably
    behind the scene it does run on the enum.
    I added some kind of hash table that keeps the docfreq already read so if I
    meet it again in another document I can retrieve it quickly - is there
    another solution? Maybe have a separate Lucene index for this? (In this case
    - can I read and write to the same index without closing it and reopening
    it? I want to read from it and if I don't find the docfreq there, calculate
    it and put it in the index).

    10x Nir.

    On Monday 06 August 2007 01:40, tierecke wrote:

    Term term=new Term("contents", termstr);
    TermEnum termenum=multireader.terms(term);
    int freq=termenum.docFreq();
    IndexReader has a docFreq() method, no need to get a Term enumeration.

    regards
    Daniel

    --
    View this message in context:
    http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12014472
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Testn at Aug 7, 2007 at 12:57 am
    Does it mean you already reuse IndexReader without reopening it? If you
    haven't done so, please try it out. docFreq() should be really quick.


    Thanks Daniel, you are completely right.
    I changed the code - but it doesn't make it [noticeably faster] - probably
    behind the scene it does run on the enum.
    I added some kind of hash table that keeps the docfreq already read so if I
    meet it again in another document I can retrieve it quickly - is there
    another solution? Maybe have a separate Lucene index for this? (In this case
    - can I read and write to the same index without closing it and reopening
    it? I want to read from it and if I don't find the docfreq there, calculate
    it and put it in the index).

    10x Nir.


    Daniel Naber-10 wrote:
    On Monday 06 August 2007 01:40, tierecke wrote:

    Term term=new Term("contents", termstr);
    TermEnum termenum=multireader.terms(term);
    int freq=termenum.docFreq();
    IndexReader has a docFreq() method, no need to get a Term enumeration.

    regards
    Daniel
    --
    View this message in context: http://www.nabble.com/docFreq-takes-long-time-to-execute-in-a-multiple-index-environment-tf4221604.html#a12026814
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 5, '07 at 11:41p
activeAug 7, '07 at 12:57a
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase