FAQ
Hi,

I'am facing some problems in using Lucene. The index I am using is
constructed like this:

try {
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
Directory dir = MMapDirectory.open(index);
IndexWriter writer = new IndexWriter(dir, analyzer,
MaxFieldLength.LIMITED);
searcher = new IndexSearcher(dir);

Document luceneDocument;
int numClusters = clustering.getClusterCount();
String[] clusterLabels = clustering.getClusterLabels();
for (int cluId = 0; cluId != numClusters; ++cluId) {
int[] docIds = clustering.getItemsOfCluster(cluId);
for (int docId : docIds) {
luceneDocument = new Document();
luceneDocument.add(new NumericField("id", Field.Store.YES,
true).setIntValue(docId));
luceneDocument.add(new NumericField("cluster_id", Field.Store.YES,
true).setIntValue(cluId));
luceneDocument.add(new Field(
"plaintext", texts.get(docId),
Field.Store.NO,
Field.Index.ANALYZED,
Field.TermVector.YES));
luceneDocument.add(new Field(
"label", clusterLabels[cluId],
Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.YES));
writer.addDocument(luceneDocument);
}
}

writer.optimize();
writer.close();

} catch (IOException e) {
e.printStackTrace();
}

Then, while the Java application is running, the speed of Lucene is good. I
can sift through about 11,000 categories in a few minutes. However, if I
restart the application and read in the previous created Lucene index
instead of generating a new one via:

try {
Directory dir = MMapDirectory.open(index);
searcher = new IndexSearcher(dir);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

Now, only about 10 categories are examined within a few minutes instead of
11,000 categories like before. Subsequently, my question is why the access
to Lucene is very slow in the second case. A usually query looks like this:

BooleanQuery booleanQuery = new BooleanQuery();
Term luceneTerm = new Term(PLAINTEXT, stemmer.process(candidate));
TermQuery termQuery = new TermQuery(luceneTerm);
booleanQuery.add(termQuery, BooleanClause.Occur.MUST);
NumericRangeQuery<Integer> lTerm =
NumericRangeQuery.newIntRange(CLUSTER_ID, clusterId, clusterId, true, true);
booleanQuery.add(lTerm, BooleanClause.Occur.MUST);
TopDocs resultSet = queryIndex(searcher, booleanQuery);

Thank you!
--
View this message in context: http://lucene.472066.n3.nabble.com/IndexSearch-very-slow-after-reopening-the-index-tp1699711p1699711.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Ian Lea at Oct 14, 2010 at 9:34 am
    Do the fast searches that you get while the app is running use the
    searcher you create before you add all the docs to the index? Surely
    that won't see the added docs.

    There are general tips on speeding up searches at
    http://wiki.apache.org/lucene-java/ImproveSearchingSpeed. There are
    some gotchas with MMapDirectory depending on your OS and whether you
    are 32 or 64 bit - see the javadocs. What are you running? What
    happens when you use a standard disk based directory rather than MMap?
    How many docs are you adding? How big is the index? What version of
    lucene are you using?

    Your NumericRangeQuery doesn't look much like a range but I doubt
    that's the problem.

    Finally, you could run a profiler to see where the time is being spent.


    --
    Ian.

    On Thu, Oct 14, 2010 at 10:07 AM, subwayne
    wrote:
    Hi,

    I'am facing some problems in using Lucene. The index I am using is
    constructed like this:

    try {
    Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
    Directory dir = MMapDirectory.open(index);
    IndexWriter writer = new IndexWriter(dir, analyzer,
    MaxFieldLength.LIMITED);
    searcher = new IndexSearcher(dir);

    Document luceneDocument;
    int numClusters = clustering.getClusterCount();
    String[] clusterLabels = clustering.getClusterLabels();
    for (int cluId = 0; cluId != numClusters; ++cluId) {
    int[] docIds = clustering.getItemsOfCluster(cluId);
    for (int docId : docIds) {
    luceneDocument = new Document();
    luceneDocument.add(new NumericField("id", Field.Store.YES,
    true).setIntValue(docId));
    luceneDocument.add(new NumericField("cluster_id", Field.Store.YES,
    true).setIntValue(cluId));
    luceneDocument.add(new Field(
    "plaintext", texts.get(docId),
    Field.Store.NO,
    Field.Index.ANALYZED,
    Field.TermVector.YES));
    luceneDocument.add(new Field(
    "label", clusterLabels[cluId],
    Field.Store.YES,
    Field.Index.ANALYZED,
    Field.TermVector.YES));
    writer.addDocument(luceneDocument);
    }
    }

    writer.optimize();
    writer.close();

    } catch (IOException e) {
    e.printStackTrace();
    }

    Then, while the Java application is running, the speed of Lucene is good. I
    can sift through about 11,000 categories in a few minutes. However, if I
    restart the application and read in the previous created Lucene index
    instead of generating a new one via:

    try {
    Directory dir = MMapDirectory.open(index);
    searcher = new IndexSearcher(dir);
    } catch (CorruptIndexException e) {
    e.printStackTrace();
    } catch (IOException e) {
    e.printStackTrace();
    }

    Now, only about 10 categories are examined within a few minutes instead of
    11,000 categories like before. Subsequently, my question is why the access
    to Lucene is very slow in the second case. A usually query looks like this:

    BooleanQuery booleanQuery = new BooleanQuery();
    Term luceneTerm = new Term(PLAINTEXT, stemmer.process(candidate));
    TermQuery termQuery = new TermQuery(luceneTerm);
    booleanQuery.add(termQuery, BooleanClause.Occur.MUST);
    NumericRangeQuery<Integer> lTerm =
    NumericRangeQuery.newIntRange(CLUSTER_ID, clusterId, clusterId, true, true);
    booleanQuery.add(lTerm, BooleanClause.Occur.MUST);
    TopDocs resultSet = queryIndex(searcher, booleanQuery);

    Thank you!
    --
    View this message in context: http://lucene.472066.n3.nabble.com/IndexSearch-very-slow-after-reopening-the-index-tp1699711p1699711.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Subwayne at Oct 14, 2010 at 10:04 am
    Hi Ian,

    thank you for your quick response. I am running Lucene on Ubuntu 10.04, 64
    bit. I switched from MMapDirectory to NIOFSDirectory without any significant
    changes in performance. The Lucene version running is 3.0.2. I followed your
    advice and opened the IndexSearcher after I added all documents to the
    index. Now, the search performance is slow, too.

    I am indexing a vast amount of the Open Directory Project; in particular
    60,000 categories with about 1,400,000 documents. Each document is
    represented by the short description made available bei ODP. Finally, the
    index on hard disk is rather small---408 MBytes.

    Regarding the NumericRangeQuery: It's true that I am only interested in an
    exact match. However I found on some forum, that one can use the
    NumericRangeQuery for exact matches as well without any performance loss.
    Nevertheless, I am willing to learn how such an exact match can be don more
    efficiently.

    While searching with Lucene, I pose about 100 queries like in my first post
    for each category.

    Last but not least, I installed VisualVM as an Eclipse Plugin. But I am not
    familiar with it and need to play with it a little bit.


    Ian Lea wrote:
    Do the fast searches that you get while the app is running use the
    searcher you create before you add all the docs to the index? Surely
    that won't see the added docs.

    There are general tips on speeding up searches at
    http://wiki.apache.org/lucene-java/ImproveSearchingSpeed. There are
    some gotchas with MMapDirectory depending on your OS and whether you
    are 32 or 64 bit - see the javadocs. What are you running? What
    happens when you use a standard disk based directory rather than MMap?
    How many docs are you adding? How big is the index? What version of
    lucene are you using?

    Your NumericRangeQuery doesn't look much like a range but I doubt
    that's the problem.

    Finally, you could run a profiler to see where the time is being spent.


    --
    Ian.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/IndexSearch-very-slow-after-reopening-the-index-tp1699711p1699956.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Pradeep Singh at Oct 14, 2010 at 10:28 am
    Many times when you run a search for the first time it has to load all field
    values IF the field is being sorted on. Subsequent searches use that cache
    and are faster. Does that happen in your case? From your description it
    doesn't look like you are sorting, although this kind of performance
    degradation would usually happen on first time sorts.
    On Thu, Oct 14, 2010 at 3:03 AM, subwayne wrote:


    Hi Ian,

    thank you for your quick response. I am running Lucene on Ubuntu 10.04, 64
    bit. I switched from MMapDirectory to NIOFSDirectory without any
    significant
    changes in performance. The Lucene version running is 3.0.2. I followed
    your
    advice and opened the IndexSearcher after I added all documents to the
    index. Now, the search performance is slow, too.

    I am indexing a vast amount of the Open Directory Project; in particular
    60,000 categories with about 1,400,000 documents. Each document is
    represented by the short description made available bei ODP. Finally, the
    index on hard disk is rather small---408 MBytes.

    Regarding the NumericRangeQuery: It's true that I am only interested in an
    exact match. However I found on some forum, that one can use the
    NumericRangeQuery for exact matches as well without any performance loss.
    Nevertheless, I am willing to learn how such an exact match can be don more
    efficiently.

    While searching with Lucene, I pose about 100 queries like in my first post
    for each category.

    Last but not least, I installed VisualVM as an Eclipse Plugin. But I am not
    familiar with it and need to play with it a little bit.


    Ian Lea wrote:
    Do the fast searches that you get while the app is running use the
    searcher you create before you add all the docs to the index? Surely
    that won't see the added docs.

    There are general tips on speeding up searches at
    http://wiki.apache.org/lucene-java/ImproveSearchingSpeed. There are
    some gotchas with MMapDirectory depending on your OS and whether you
    are 32 or 64 bit - see the javadocs. What are you running? What
    happens when you use a standard disk based directory rather than MMap?
    How many docs are you adding? How big is the index? What version of
    lucene are you using?

    Your NumericRangeQuery doesn't look much like a range but I doubt
    that's the problem.

    Finally, you could run a profiler to see where the time is being spent.


    --
    Ian.
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/IndexSearch-very-slow-after-reopening-the-index-tp1699711p1699956.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Oct 14, 2010 at 10:30 am
    OK, so it looks like we're down to a more general "why is searching
    slow" question.

    The number of docs is not very large by lucene standards.

    Work through http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.

    If that still doesn't help, pick a slow query and post again with:

    . the output of query.toString()
    . the number of docs matched
    . how long it took first time on new searcher, and second and third
    and ... tenth
    . ummm... anything else you think might be relevant.

    Good luck.


    --
    Ian.

    On Thu, Oct 14, 2010 at 11:03 AM, subwayne
    wrote:
    Hi Ian,

    thank you for your quick response. I am running Lucene on Ubuntu 10.04, 64
    bit. I switched from MMapDirectory to NIOFSDirectory without any significant
    changes in performance. The Lucene version running is 3.0.2. I followed your
    advice and opened the IndexSearcher after I added all documents to the
    index. Now, the search performance is slow, too.

    I am indexing a vast amount of the Open Directory Project; in particular
    60,000 categories with about 1,400,000 documents. Each document is
    represented by the short description made available bei ODP. Finally, the
    index on hard disk is rather small---408 MBytes.

    Regarding the NumericRangeQuery: It's true that I am only interested in an
    exact match. However I found on some forum, that one can use the
    NumericRangeQuery for exact matches as well without any performance loss.
    Nevertheless, I am willing to learn how such an exact match can be don more
    efficiently.

    While searching with Lucene, I pose about 100 queries like in my first post
    for each category.

    Last but not least, I installed VisualVM as an Eclipse Plugin. But I am not
    familiar with it and need to play with it a little bit.


    Ian Lea wrote:
    Do the fast searches that you get while the app is running use the
    searcher you create before you add all the docs to the index?  Surely
    that won't see the added docs.

    There are general tips on speeding up searches at
    http://wiki.apache.org/lucene-java/ImproveSearchingSpeed.  There are
    some gotchas with MMapDirectory depending on your OS and whether you
    are 32 or 64 bit - see the javadocs.  What are you running? What
    happens when you use a standard disk based directory rather than MMap?
    How many docs are you adding?  How big is the index?  What version of
    lucene are you using?

    Your NumericRangeQuery doesn't look much like a range but I doubt
    that's the problem.

    Finally, you could run a profiler to see where the time is being spent.


    --
    Ian.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/IndexSearch-very-slow-after-reopening-the-index-tp1699711p1699956.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • David Clarke at Oct 14, 2010 at 12:40 pm
    Hey Guys

    Whenever I try to view open issues in hudson it doesn't display any information.
    Does anyone know why this is the case or how I could fix it?
    Thanks in advance
    -Dave Clarke
  • Subwayne at Oct 14, 2010 at 1:31 pm
    Ok, I read the Wiki page related to improving the searching speed and adopted
    some advices. One of the slow queries is simply. Here are some:

    plaintext:guid
    107.0 ms
    resultSet.totalHits = 1

    plaintext:allianc
    51.0 ms
    resultSet.totalHists = 1

    plaintext:engin
    46.0 ms
    resultSet.totalHits = 1

    plaintext:servicetec
    46.0 ms
    resultSet.totalHits = 1

    .. and so on. I pose about one hundred queries for each category. Therefore,
    I retrieve a list of documents of a category by utilizing a QueryFilter:

    Term luceneTerm = new Term("plaintext", stemmer.process(candidate));
    TermQuery termQuery = new TermQuery(luceneTerm);
    Filter qf = new CachingWrapperFilter(new QueryWrapperFilter(termQuery));

    TopDocs resultSet = searcher.search(lTerm, qf, Integer.MAX_VALUE);

    Each subsequent query is "only" 46 ms instead of 107 ms. However, I think it
    is very slow. Note that these values are taken while the Lucene index is in
    the RAM (RAMDirectory). It makes no difference in time if I am using the
    RAMDirectory or NFIOSDirectory.

    Thanks for further advices.

    --
    View this message in context: http://lucene.472066.n3.nabble.com/IndexSearch-very-slow-after-reopening-the-index-tp1699711p1701013.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Oct 15, 2010 at 9:03 am
    I'm a bit confused about what exactly you are timing. Is the 46 ms
    for one search on one term with one hit, or for 100 similar searches
    or what?

    Perhaps a minimal self-contained search program demonstrating exactly
    what you are doing would help, with evidence of where it is spending
    time.


    --
    Ian.

    On Thu, Oct 14, 2010 at 2:31 PM, subwayne wrote:

    Ok, I read the Wiki page related to improving the searching speed and adopted
    some advices. One of the slow queries is simply. Here are some:

    plaintext:guid
    107.0 ms
    resultSet.totalHits = 1

    plaintext:allianc
    51.0 ms
    resultSet.totalHists = 1

    plaintext:engin
    46.0 ms
    resultSet.totalHits = 1

    plaintext:servicetec
    46.0 ms
    resultSet.totalHits = 1

    .. and so on. I pose about one hundred queries for each category. Therefore,
    I retrieve a list of documents of a category by utilizing a QueryFilter:

    Term luceneTerm = new Term("plaintext", stemmer.process(candidate));
    TermQuery termQuery = new TermQuery(luceneTerm);
    Filter qf = new CachingWrapperFilter(new QueryWrapperFilter(termQuery));

    TopDocs resultSet = searcher.search(lTerm, qf, Integer.MAX_VALUE);

    Each subsequent query is "only" 46 ms instead of 107 ms. However, I think it
    is very slow. Note that these values are taken while the Lucene index is in
    the RAM (RAMDirectory). It makes no difference in time if I am using the
    RAMDirectory or NFIOSDirectory.

    Thanks for further advices.

    --
    View this message in context: http://lucene.472066.n3.nabble.com/IndexSearch-very-slow-after-reopening-the-index-tp1699711p1701013.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 14, '10 at 9:07a
activeOct 15, '10 at 9:03a
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase