FAQ
Hi!

Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
BooleanQueries.

Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Chris Hostetter at Dec 14, 2007 at 11:17 pm
    : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
    : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
    : BooleanQueries.

    Applications are certainly welcome to do this (there is nothing to stop
    you from calling rewrite before passing the query to your Searcher, i
    believe the overhead of calling rewrite on a query that's already been
    rewritten is fairly low) but I don't think it would be a good idea to add
    something like this to the core ...for starters we are trying to move
    away from "hidden" caches like this that are not transparent (and
    controllable) but the users because they have the potential to eat up a
    lot of ram. But also: he amount of time needed to rewrite the query is
    probably not vastly more expensive then the anout of time to execute the
    search .. you might as well cache the entire result keyed off of the
    orriginal query (and not just the rewritten query object).


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Timo Nentwig at Dec 15, 2007 at 7:23 pm

    On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
    : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
    : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
    : BooleanQueries.

    Applications are certainly welcome to do this (there is nothing to stop
    you from calling rewrite before passing the query to your Searcher, i
    believe the overhead of calling rewrite on a query that's already been
    rewritten is fairly low) but I don't think it would be a good idea to add
    Why should subsequent rewrites be faster? The query is being rewritten every
    time over and over again. Even *if* you can profit by buffered IO you sill
    have a plenty of string levenshtein OPs.

    I'm against caching in general because you always run into some hard to
    understand and examine problem but this seems to be one of the rare cases
    where caching makes sense.

    I attached a small test app, the index contains 2.2 million docs and 5 million
    terms, I search for a pretty common term which was rewritten to 15 terms and
    hit roughly 4.000 docs (I also tried a term that was rewritten to 1 term and
    hit about 300 docs, made no difference):

    rewritten in 809
    Overall search time: 842
    rewritten in 271
    Overall search time: 274
    rewritten in 216
    Overall search time: 219
    rewritten in 180
    Overall search time: 182
    rewritten in 184
    Overall search time: 186
    rewritten in 220
    Overall search time: 226
    rewritten in 207
    Overall search time: 208
    rewritten in 181
    Overall search time: 183
    rewritten in 183
    Overall search time: 185
    rewritten in 180
    Overall search time: 181

    $ vmstat -S M
    procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 0 757 298 56 384 0 0 21 36 39 9 5 1 94 0
    something like this to the core ...for starters we are trying to move
    away from "hidden" caches like this that are not transparent (and
    Well, at least the existing of such an decorator (which you explicitly have to
    use) will give you a hint that this is performance hot spot. I took me quite
    some time to figure it out...
    controllable) but the users because they have the potential to eat up a
    lot of ram. But also: he amount of time needed to rewrite the query is
    probably not vastly more expensive then the anout of time to execute the
    search .. you might as well cache the entire result keyed off of the
    orriginal query (and not just the rewritten query object).


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley at Dec 15, 2007 at 7:49 pm

    On Dec 15, 2007 2:23 PM, Timo Nentwig wrote:
    On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
    : Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
    : caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
    : BooleanQueries.

    Applications are certainly welcome to do this (there is nothing to stop
    you from calling rewrite before passing the query to your Searcher, i
    believe the overhead of calling rewrite on a query that's already been
    rewritten is fairly low) but I don't think it would be a good idea to add
    Why should subsequent rewrites be faster?
    Hoss means calling rewrite on the *result* of a rewrite.
    So the application would call rewrite, cache the resulting query, and
    then use that already rewritten query from then on. Lucene wouldn't
    know it had already been rewritten, and would call rewrite again, but
    it should be fast.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Timo Nentwig at Dec 15, 2007 at 9:27 pm

    On Saturday 15 December 2007 20:48:38 Yonik Seeley wrote:
    On Dec 15, 2007 2:23 PM, Timo Nentwig wrote:
    On Saturday 15 December 2007 00:17:10 Chris Hostetter wrote:
    : Actually FuzzyQuery.rewrite() is pretty expensive so why not
    : introduce a caching decorator? A WeakHashMap with key==IndexReader
    : and value==LRU of BooleanQueries.

    Applications are certainly welcome to do this (there is nothing to stop
    you from calling rewrite before passing the query to your Searcher, i
    believe the overhead of calling rewrite on a query that's already been
    rewritten is fairly low) but I don't think it would be a good idea to
    add
    Why should subsequent rewrites be faster?
    Hoss means calling rewrite on the *result* of a rewrite.
    Uh? That's what I mean (propose), too... But currently nothing's cached at
    all.

    Cache the result (BooleanQuery) of rewrite() in a WeakHashMap with key =
    IndexReader and value = LRU.
    So the application would call rewrite, cache the resulting query, and
    then use that already rewritten query from then on. Lucene wouldn't
    know it had already been rewritten, and would call rewrite again, but
    it should be fast.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Chris Hostetter at Dec 16, 2007 at 5:22 pm
    : > Hoss means calling rewrite on the *result* of a rewrite.
    :
    : Uh? That's what I mean (propose), too... But currently nothing's cached at
    : all.
    :
    : Cache the result (BooleanQuery) of rewrite() in a WeakHashMap with key =
    : IndexReader and value = LRU.

    yes .. you can do that. you don't need anything in the core lucene
    Searchers to do it for you (ie: nothing needs refactored out, because
    redundent rewrite calls are cheap) ... in the paragraph below,
    "application" refers to your code, "Lucene" refers to the code in the
    lucene JAR...

    : > So the application would call rewrite, cache the resulting query, and
    : > then use that already rewritten query from then on. Lucene wouldn't
    : > know it had already been rewritten, and would call rewrite again, but
    : > it should be fast.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Chris Hostetter at Dec 16, 2007 at 5:42 pm
    : I'm against caching in general because you always run into some hard to
    : understand and examine problem but this seems to be one of the rare cases
    : where caching makes sense.

    For you maybe, but i've actually seen very few instances in practice where
    people who are speed concious use FuzzyQueries (or other queries with
    really expensive rewrite methods) to a large degree .... in the few casees
    people do:
    1) they usually go whole hog and cache search results
    (ie: query+sort => list of docs)
    2) they want direct control of the cache.

    : Well, at least the existing of such an decorator (which you explicitly have to
    : use) will give you a hint that this is performance hot spot. I took me quite
    : some time to figure it out...

    i would argue that for a lot of people something like "Solr" is that
    decorator ... but if you'd like to submit a contrib that does something
    like this with a nice generic API by all means please do.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Timo Nentwig at Sep 6, 2008 at 6:56 pm

    On Tue, 11 Dec 2007, Timo Nentwig wrote:

    Date: Tue, 11 Dec 2007 13:27:59 +0100
    From: Timo Nentwig <lucene@nitwit.de>
    Reply-To: java-dev@lucene.apache.org
    To: java-dev@lucene.apache.org
    Subject: Caching FuzzyQuery

    Hi!

    Actually FuzzyQuery.rewrite() is pretty expensive so why not introduce a
    caching decorator? A WeakHashMap with key==IndexReader and value==LRU of
    BooleanQueries.
    So, didn't become a popular approach/topic. Anyway, I wrote an aspect and
    it does the job for a couple of months now, so I'd like to share it.

    HTH
    Timo


    In order to enable runtime AspectJ weaving start your JVM with

    -javaagent:./foo/bar/lib/aspectjweaver.jar

    and place the following file called "aop.xml" in META-INF.

    <aspectj>
    <aspects>
    <aspect name="foo.bar.FuzzyQueryCachingAspect" />
    </aspects>

    <weaver options="-verbose -showWeaveInfo">
    <include within="org.apache.lucene.search.*" />
    <include within="org.apache.lucene.index.*" />
    </weaver>
    </aspectj>


    @Aspect
    public class FuzzyQueryCachingAspect
    {
    private static final Log LOG = LogFactory.getLog(
    FuzzyQueryCachingAspect.class );
    // your cache impl goes here
    private static final Caching CACHE = Caching.instance(
    Caching.FUZZY );

    @Before( "execution(public void
    org.apache.lucene.index.IndexReader+.close())" )
    public void flush( final JoinPoint jp ) throws Throwable
    {
    assert jp.getThis() == jp.getTarget();

    final IndexReader r = (IndexReader)jp.getThis();
    final Directory d = r.directory();

    final String group = name( d );

    CACHE.remove( group );
    LOG.info( "Removed from " + CACHE + ": " + group );
    }

    @Around( "execution(public org.apache.lucene.search.Query
    org.apache.lucene.search.FuzzyQuery.rewrite(org.apache.lucene.index.IndexReader))
    && args(reader)" )
    public Object cache( final IndexReader reader, final
    ProceedingJoinPoint jp ) throws Throwable
    {
    assert jp.getThis() == jp.getTarget();

    final FuzzyQuery q = (FuzzyQuery)jp.getThis();
    final String term = q.getTerm().text();

    final String group = name( reader.directory() );
    final String key = term + "~" + q.getMinSimilarity();

    final Object e = CACHE.get( group, key );
    if( e != null )
    {
    LOG.debug( term + " found in cache" );
    return e;
    }

    final long t0 = System.currentTimeMillis();
    final Query fuzzy = (Query)jp.proceed();
    CACHE.put( group, key, fuzzy );
    LOG.info( term + " rewritten in " +
    (System.currentTimeMillis() - t0) + "ms (" + group + ")" );

    return fuzzy;
    }

    private static final String name( final Directory d )
    {
    if( d instanceof FSDirectory ) return
    ((FSDirectory)d).getFile().getAbsolutePath();

    return String.valueOf( d.hashCode() );
    }
    }


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedDec 11, '07 at 12:28p
activeSep 6, '08 at 6:56p
posts8
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase