FAQ
I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in it.

We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older versions as well).

By looking at my logs, I'm finding that phrase queries are especially long to perform. In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on purpose.

If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some results back, and a bit less afterwards (7sec).

More complete phrases that match maybe only 1 document can also take >10 secs if they have many stopwords in them.

I was wondering if this a normal behavior considering the fact that we do not remove stopwords?

Also, on some phrase queries (not all), the difference between the first call and any subsequent calls can be very big. For example, it could take 5 seconds to do one query and then less than 1 second to perform it again.

Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file system caching at work?

If this is a normal behavior, I assume that the solution is either to remove stopwords from the index or shard it and ParallelMultiSearch it.

What do you think?

Daniel Shane




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Michael McCandless at Mar 19, 2010 at 11:14 pm
    Nutch/Solr's CommonGrams is the right way to solve this. It combines
    frequent terms (eg stopwords) with adjacent terms. So "the wizard of
    oz" will be indexed eg as the_wizard wizard_of of_oz. It'll require a
    full re-index though, and you have to fixup searching so that the same
    term expansion works.

    Lucene does no caching that'd speed up PhraseQuery (unless that was
    the very first query against the field since you opened the reader) --
    this is probably the OS's IO cache.

    Mike
    On Fri, Mar 19, 2010 at 3:56 PM, Daniel Shane wrote:
    I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in it.

    We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older versions as well).

    By looking at my logs, I'm finding that phrase queries are especially long to perform. In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on purpose.

    If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some results back, and a bit less afterwards (7sec).

    More complete phrases that match maybe only 1 document can also take >10 secs if they have many stopwords in them.

    I was wondering if this a normal behavior considering the fact that we do not remove stopwords?

    Also, on some phrase queries (not all), the difference between the first call and any subsequent calls can be very big. For example, it could take 5 seconds to do one query and then less than 1 second to perform it again.

    Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file system caching at work?

    If this is a normal behavior, I assume that the solution is either to remove stopwords from the index or shard it and ParallelMultiSearch it.

    What do you think?

    Daniel Shane




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Daniel Shane at Mar 22, 2010 at 2:48 pm
    Indeed!

    I found a very good article on this as well at :

    http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

    It really sums up what you are saying.

    Thanks for the help!
    Daniel Shane

    ----- Original Message -----
    From: "Michael McCandless" <lucene@mikemccandless.com>
    To: java-user@lucene.apache.org
    Sent: Friday, March 19, 2010 7:14:06 PM
    Subject: Re: PhraseQuery Performance Issues [Lucene 2.9.0]

    Nutch/Solr's CommonGrams is the right way to solve this. It combines
    frequent terms (eg stopwords) with adjacent terms. So "the wizard of
    oz" will be indexed eg as the_wizard wizard_of of_oz. It'll require a
    full re-index though, and you have to fixup searching so that the same
    term expansion works.

    Lucene does no caching that'd speed up PhraseQuery (unless that was
    the very first query against the field since you opened the reader) --
    this is probably the OS's IO cache.

    Mike
    On Fri, Mar 19, 2010 at 3:56 PM, Daniel Shane wrote:
    I'm running a medium size web search with a index size just shy of 9GB with 800000 docs in it.

    We are suing Lucene version 2.9.0 (we have not checked yet to see if this applies to older versions as well).

    By looking at my logs, I'm finding that phrase queries are especially long to perform. In our index, we do not remove stopwords, so things like "the" and "is" are getting indexed on purpose.

    If I try a phrase search like "The The" it will take about 10 seconds in Luke to get some results back, and a bit less afterwards (7sec).

    More complete phrases that match maybe only 1 document can also take >10 secs if they have many stopwords in them.

    I was wondering if this a normal behavior considering the fact that we do not remove stopwords?

    Also, on some phrase queries (not all), the difference between the first call and any subsequent calls can be very big. For example, it could take 5 seconds to do one query and then less than 1 second to perform it again.

    Does Lucene, by default, cache anything when a (phrase) query is made or is this simply file system caching at work?

    If this is a normal behavior, I assume that the solution is either to remove stopwords from the index or shard it and ParallelMultiSearch it.

    What do you think?

    Daniel Shane




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Rafael Turk at Mar 30, 2010 at 2:25 am
    unsubscribe

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 19, '10 at 8:57p
activeMar 30, '10 at 2:25a
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase