FAQ
Hi

I read CWF today and initially I thought this is going to cache a Filter
in-memory for me, so that I can more efficiently use it for subsequent
searches. But I learned that all it does is cache the DocIdSet returned by
the wrapped Filter.

This is good in and on itself, but I wonder if we shouldn't go the extra
mile and wrap stuff in memory for Filters which don't operate from memory.
For example - I have a Filter which reads information from a Payload as it's
iterated on, so it doesn't keep anything in memory (it's per-user
information, so I haven't decided yet if I can afford caching it in-memory
and whether it will be beneficial). Caching that sort of Filter by CWF will
obviously not improve anything.

I'm not sure what to do here:
1. Just reflect that in the javadoc (it is very confusing saying "Wraps
another filter's result and caches it", which is not true)
2. Introduce a class which takes a Filter and loads it into memory (I think
I read an issue/discussion about this), to an OpenBitSet for example (but we
need to know the number of results in advance, or grow the array as we go
along).
3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility, and
cache the Filters w/ the user as Key.

I will probably need to do the second part of (3) anyway, so I'm asking
whether such a utility is useful to exist in Lucene, and perhaps there's
already one (I thought I read somewhere about the ability to execute a Query
and get back a Filter, or use the results as a Filter)? I looked at
QueryWrapperFilter, but it doesn't seem to give me what I need, since its
getDocIdSet method returns an iterator which is the Scorer of the Query that
it wraps.

Anyway, I think the documentation of CWF should be fixed and made clearer.

Any thoughts?

Shai

Search Discussions

  • Paul Elschot at Jun 2, 2009 at 6:35 pm

    On Tuesday 02 June 2009 16:39:06 Shai Erera wrote:
    Hi

    I read CWF today and initially I thought this is going to cache a Filter
    in-memory for me, so that I can more efficiently use it for subsequent
    searches. But I learned that all it does is cache the DocIdSet returned by
    the wrapped Filter.

    This is good in and on itself, but I wonder if we shouldn't go the extra
    mile and wrap stuff in memory for Filters which don't operate from memory.
    It was good until QueryWrapperFilter returned a Scorer instead of a disi
    based on an (Open)BitSet.
    For example - I have a Filter which reads information from a Payload as it's
    iterated on, so it doesn't keep anything in memory (it's per-user
    information, so I haven't decided yet if I can afford caching it in-memory
    and whether it will be beneficial). Caching that sort of Filter by CWF will
    obviously not improve anything.

    I'm not sure what to do here:
    1. Just reflect that in the javadoc (it is very confusing saying "Wraps
    another filter's result and caches it", which is not true)
    2. Introduce a class which takes a Filter and loads it into memory (I think
    I read an issue/discussion about this), to an OpenBitSet for example (but we
    need to know the number of results in advance, or grow the array as we go
    along).
    3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility, and
    cache the Filters w/ the user as Key.
    For that, one could subclass CWF and override the docIdSetToCache method
    to return an OpenBitSetDISI constructed from the given disi.
    I will probably need to do the second part of (3) anyway, so I'm asking
    whether such a utility is useful to exist in Lucene, and perhaps there's
    already one (I thought I read somewhere about the ability to execute a Query
    and get back a Filter, or use the results as a Filter)?
    That is what QueryWrapperFilter does.
    I looked at
    QueryWrapperFilter, but it doesn't seem to give me what I need, since its
    getDocIdSet method returns an iterator which is the Scorer of the Query that
    it wraps.
    The Scorer seems to be what you need, but there are cheaper disis, see below.
    Anyway, I think the documentation of CWF should be fixed and made clearer.

    Any thoughts?
    The basic problem is that disis from DocIdSets come in two variations: expensive
    ones e.g. based on a query, and cheap ones based e.g. on an OpenBitSet or on
    a SortedVIntList.
    One would normally want to cache a DocIdSet that provides a cheap disi.

    For the javadocs of the current CWF it could be sufficient to mention more
    prominently that the default CWF caches the given DocIdSet, basically
    assuming that it's disi is cheap.

    But it might be a good idea to change the default implementation to check
    whether the given DocIdSet is an OpenBitSet, and use that to be cached in
    that case, and otherwise provide an OpenBitSetDISI.

    Regards,
    Paul Elschot
  • Shai Erera at Jun 3, 2009 at 8:04 am
    Thanks Paul !

    I'll work such a utility (which takes a Filter and reads it into an
    OpenBitSet, SortedVIntList) and then post back in case you'll be interested
    in adopting it, and change CWF to use it, or something else.

    Shai
    On Tue, Jun 2, 2009 at 9:35 PM, Paul Elschot wrote:
    On Tuesday 02 June 2009 16:39:06 Shai Erera wrote:
    Hi

    I read CWF today and initially I thought this is going to cache a Filter
    in-memory for me, so that I can more efficiently use it for subsequent
    searches. But I learned that all it does is cache the DocIdSet returned by
    the wrapped Filter.

    This is good in and on itself, but I wonder if we shouldn't go the extra
    mile and wrap stuff in memory for Filters which don't operate from
    memory.


    It was good until QueryWrapperFilter returned a Scorer instead of a disi
    based on an (Open)BitSet.

    For example - I have a Filter which reads information from a Payload as it's
    iterated on, so it doesn't keep anything in memory (it's per-user
    information, so I haven't decided yet if I can afford caching it in-memory
    and whether it will be beneficial). Caching that sort of Filter by CWF will
    obviously not improve anything.

    I'm not sure what to do here:
    1. Just reflect that in the javadoc (it is very confusing saying "Wraps
    another filter's result and caches it", which is not true)
    2. Introduce a class which takes a Filter and loads it into memory (I think
    I read an issue/discussion about this), to an OpenBitSet for example (but we
    need to know the number of results in advance, or grow the array as we go
    along).
    3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility, and
    cache the Filters w/ the user as Key.

    For that, one could subclass CWF and override the docIdSetToCache method
    to return an OpenBitSetDISI constructed from the given disi.

    I will probably need to do the second part of (3) anyway, so I'm asking
    whether such a utility is useful to exist in Lucene, and perhaps there's
    already one (I thought I read somewhere about the ability to execute a Query
    and get back a Filter, or use the results as a Filter)?

    That is what QueryWrapperFilter does.

    I looked at
    QueryWrapperFilter, but it doesn't seem to give me what I need, since its
    getDocIdSet method returns an iterator which is the Scorer of the Query that
    it wraps.

    The Scorer seems to be what you need, but there are cheaper disis, see
    below.

    Anyway, I think the documentation of CWF should be fixed and made clearer.
    Any thoughts?

    The basic problem is that disis from DocIdSets come in two variations:
    expensive
    ones e.g. based on a query, and cheap ones based e.g. on an OpenBitSet or
    on
    a SortedVIntList.
    One would normally want to cache a DocIdSet that provides a cheap disi.


    For the javadocs of the current CWF it could be sufficient to mention more
    prominently that the default CWF caches the given DocIdSet, basically
    assuming that it's disi is cheap.


    But it might be a good idea to change the default implementation to check
    whether the given DocIdSet is an OpenBitSet, and use that to be cached in
    that case, and otherwise provide an OpenBitSetDISI.


    Regards,
    Paul Elschot

  • Michael McCandless at Jun 9, 2009 at 4:50 pm
    I think, once we can efficiently apply cheap random-access docIDSets
    the way deleted docs are applied (ie, distribute down to all
    SegmentTermDocs) then it'd be useful for this filter manager to also
    pre-fold deletes in, such that SegmentTermDocs would only have a
    single random-access docIDSet to check.

    Mike

    On Wed, Jun 3, 2009 at 4:03 AM, Shai Ererawrote:
    Thanks Paul !

    I'll work such a utility (which takes a Filter and reads it into an
    OpenBitSet, SortedVIntList) and then post back in case you'll be interested
    in adopting it, and change CWF to use it, or something else.

    Shai
    On Tue, Jun 2, 2009 at 9:35 PM, Paul Elschot wrote:
    On Tuesday 02 June 2009 16:39:06 Shai Erera wrote:
    Hi

    I read CWF today and initially I thought this is going to cache a Filter
    in-memory for me, so that I can more efficiently use it for subsequent
    searches. But I learned that all it does is cache the DocIdSet returned
    by
    the wrapped Filter.

    This is good in and on itself, but I wonder if we shouldn't go the extra
    mile and wrap stuff in memory for Filters which don't operate from
    memory.
    It was good until QueryWrapperFilter returned a Scorer instead of a disi
    based on an (Open)BitSet.
    For example - I have a Filter which reads information from a Payload as
    it's
    iterated on, so it doesn't keep anything in memory (it's per-user
    information, so I haven't decided yet if I can afford caching it
    in-memory
    and whether it will be beneficial). Caching that sort of Filter by CWF
    will
    obviously not improve anything.

    I'm not sure what to do here:
    1. Just reflect that in the javadoc (it is very confusing saying "Wraps
    another filter's result and caches it", which is not true)
    2. Introduce a class which takes a Filter and loads it into memory (I
    think
    I read an issue/discussion about this), to an OpenBitSet for example
    (but we
    need to know the number of results in advance, or grow the array as we
    go
    along).
    3. Don't use CWF, write a "load-a-Filter-into-in-memory-Filter" utility,
    and
    cache the Filters w/ the user as Key.
    For that, one could subclass CWF and override the docIdSetToCache method
    to return an OpenBitSetDISI constructed from the given disi.
    I will probably need to do the second part of (3) anyway, so I'm asking
    whether such a utility is useful to exist in Lucene, and perhaps there's
    already one (I thought I read somewhere about the ability to execute a
    Query
    and get back a Filter, or use the results as a Filter)?
    That is what QueryWrapperFilter does.
    I looked at
    QueryWrapperFilter, but it doesn't seem to give me what I need, since
    its
    getDocIdSet method returns an iterator which is the Scorer of the Query
    that
    it wraps.
    The Scorer seems to be what you need, but there are cheaper disis, see
    below.
    Anyway, I think the documentation of CWF should be fixed and made
    clearer.

    Any thoughts?
    The basic problem is that disis from DocIdSets come in two variations:
    expensive
    ones e.g. based on a query, and cheap ones based e.g. on an OpenBitSet or
    on
    a SortedVIntList.
    One would normally want to cache a DocIdSet that provides a cheap disi.

    For the javadocs of the current CWF it could be sufficient to mention more
    prominently that the default CWF caches the given DocIdSet, basically
    assuming that it's disi is cheap.

    But it might be a good idea to change the default implementation to check
    whether the given DocIdSet is an OpenBitSet, and use that to be cached in
    that case, and otherwise provide an OpenBitSetDISI.

    Regards,
    Paul Elschot
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedJun 2, '09 at 2:39p
activeJun 9, '09 at 4:50p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase