FAQ
I thought of a possible enhancement - before I go down the road, I am
looking for some input form the community?

Currently, the QueryFilter caches the bits base upon the IndexReader.

The problem with this is small incremental changes to the index
invalidate the cache.

What if instead the filter determined that the underlying IndexReader
was a MultiReader and then maintained a bitset for each reader,
combining them in bits() when requested. The filter could check if
any of the underlying readers were the different (removed or added)
and then just create a new bitset for that reader. With the new non-
bit set filter implementations this could be even more memory
efficient since the bitsets would not need to be combined into a
single bitset.

With the previous work on "reopen" so that segments are reused, this
would allow filters to be far more useful in a highly interactive
environment.

What do you think?






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Chris Hostetter at Jul 7, 2006 at 7:48 pm
    I'm no segments/MultiReader expert, but your idea sounds good to me ... it
    seems like it would certainly work in the "new segments" situation.

    One thing i don't see you mention is dealing with deletions ... i'm not
    sure if deleting documents cause the version number of an IndexReader to
    change or not (if it does your job is easy) but even if it doesn't I'm
    guessing you could say that if hasDeletions() returns true, you have to
    assume you need to invalidate your cached bits (worst case scenerio you
    are invalidating the cache as often as it is now)


    : Date: Fri, 7 Jul 2006 00:32:54 -0500
    : From: robert engels <rengels@ix.netcom.com>
    : Reply-To: java-dev@lucene.apache.org
    : To: Lucene-Dev <java-dev@lucene.apache.org>
    : Subject: MultiSegmentQueryFilter enhancement for interactive indexes?
    :
    : I thought of a possible enhancement - before I go down the road, I am
    : looking for some input form the community?
    :
    : Currently, the QueryFilter caches the bits base upon the IndexReader.
    :
    : The problem with this is small incremental changes to the index
    : invalidate the cache.
    :
    : What if instead the filter determined that the underlying IndexReader
    : was a MultiReader and then maintained a bitset for each reader,
    : combining them in bits() when requested. The filter could check if
    : any of the underlying readers were the different (removed or added)
    : and then just create a new bitset for that reader. With the new non-
    : bit set filter implementations this could be even more memory
    : efficient since the bitsets would not need to be combined into a
    : single bitset.
    :
    : With the previous work on "reopen" so that segments are reused, this
    : would allow filters to be far more useful in a highly interactive
    : environment.
    :
    : What do you think?
    :
    :
    :
    :
    :
    :
    : ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-dev-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jul 8, 2006 at 1:35 am
    I implemented it and it works great. I didn't worry about the
    deletions since by the time a filter is used the deleted documents
    are already removed by the query. The only problem that arose out of
    this was for things like the ConstantScoreQuery (which uses a filter)
    - I needed to modify this query to ignore deleted documents.

    Now I have incremental cached filters - the query performance is
    going through the roof.


    On Jul 7, 2006, at 2:47 PM, Chris Hostetter wrote:


    I'm no segments/MultiReader expert, but your idea sounds good to
    me ... it
    seems like it would certainly work in the "new segments" situation.

    One thing i don't see you mention is dealing with deletions ... i'm
    not
    sure if deleting documents cause the version number of an
    IndexReader to
    change or not (if it does your job is easy) but even if it doesn't I'm
    guessing you could say that if hasDeletions() returns true, you
    have to
    assume you need to invalidate your cached bits (worst case scenerio
    you
    are invalidating the cache as often as it is now)


    : Date: Fri, 7 Jul 2006 00:32:54 -0500
    : From: robert engels <rengels@ix.netcom.com>
    : Reply-To: java-dev@lucene.apache.org
    : To: Lucene-Dev <java-dev@lucene.apache.org>
    : Subject: MultiSegmentQueryFilter enhancement for interactive
    indexes?
    :
    : I thought of a possible enhancement - before I go down the road,
    I am
    : looking for some input form the community?
    :
    : Currently, the QueryFilter caches the bits base upon the
    IndexReader.
    :
    : The problem with this is small incremental changes to the index
    : invalidate the cache.
    :
    : What if instead the filter determined that the underlying
    IndexReader
    : was a MultiReader and then maintained a bitset for each reader,
    : combining them in bits() when requested. The filter could check if
    : any of the underlying readers were the different (removed or added)
    : and then just create a new bitset for that reader. With the new non-
    : bit set filter implementations this could be even more memory
    : efficient since the bitsets would not need to be combined into a
    : single bitset.
    :
    : With the previous work on "reopen" so that segments are reused, this
    : would allow filters to be far more useful in a highly interactive
    : environment.
    :
    : What do you think?
    :
    :
    :
    :
    :
    :
    :
    ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-dev-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley at Jul 8, 2006 at 2:54 am
    This might be even better in conjunction with moving away from BitSet
    to some sort of interface like DocNrSkipper... that way you would
    never have to combine the filters into a single BitSet.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server
    On 7/7/06, robert engels wrote:
    I implemented it and it works great. I didn't worry about the
    deletions since by the time a filter is used the deleted documents
    are already removed by the query. The only problem that arose out of
    this was for things like the ConstantScoreQuery (which uses a filter)
    - I needed to modify this query to ignore deleted documents.

    Now I have incremental cached filters - the query performance is
    going through the roof.


    On Jul 7, 2006, at 2:47 PM, Chris Hostetter wrote:


    I'm no segments/MultiReader expert, but your idea sounds good to
    me ... it
    seems like it would certainly work in the "new segments" situation.

    One thing i don't see you mention is dealing with deletions ... i'm
    not
    sure if deleting documents cause the version number of an
    IndexReader to
    change or not (if it does your job is easy) but even if it doesn't I'm
    guessing you could say that if hasDeletions() returns true, you
    have to
    assume you need to invalidate your cached bits (worst case scenerio
    you
    are invalidating the cache as often as it is now)


    : Date: Fri, 7 Jul 2006 00:32:54 -0500
    : From: robert engels <rengels@ix.netcom.com>
    : Reply-To: java-dev@lucene.apache.org
    : To: Lucene-Dev <java-dev@lucene.apache.org>
    : Subject: MultiSegmentQueryFilter enhancement for interactive
    indexes?
    :
    : I thought of a possible enhancement - before I go down the road,
    I am
    : looking for some input form the community?
    :
    : Currently, the QueryFilter caches the bits base upon the
    IndexReader.
    :
    : The problem with this is small incremental changes to the index
    : invalidate the cache.
    :
    : What if instead the filter determined that the underlying
    IndexReader
    : was a MultiReader and then maintained a bitset for each reader,
    : combining them in bits() when requested. The filter could check if
    : any of the underlying readers were the different (removed or added)
    : and then just create a new bitset for that reader. With the new non-
    : bit set filter implementations this could be even more memory
    : efficient since the bitsets would not need to be combined into a
    : single bitset.
    :
    : With the previous work on "reopen" so that segments are reused, this
    : would allow filters to be far more useful in a highly interactive
    : environment.
    :
    : What do you think?
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jul 8, 2006 at 3:02 am
    Exactly. I have been watching to see how the new filer interface
    works out for 2.0. I am still not certain why it is so involved.

    I still think

    interface Filter {
    boolean include(int doc);
    int nextInclude(int doc);
    }

    should suffice.
    On Jul 7, 2006, at 9:53 PM, Yonik Seeley wrote:

    This might be even better in conjunction with moving away from BitSet
    to some sort of interface like DocNrSkipper... that way you would
    never have to combine the filters into a single BitSet.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene
    search server
    On 7/7/06, robert engels wrote:
    I implemented it and it works great. I didn't worry about the
    deletions since by the time a filter is used the deleted documents
    are already removed by the query. The only problem that arose out of
    this was for things like the ConstantScoreQuery (which uses a filter)
    - I needed to modify this query to ignore deleted documents.

    Now I have incremental cached filters - the query performance is
    going through the roof.


    On Jul 7, 2006, at 2:47 PM, Chris Hostetter wrote:


    I'm no segments/MultiReader expert, but your idea sounds good to
    me ... it
    seems like it would certainly work in the "new segments" situation.

    One thing i don't see you mention is dealing with deletions ... i'm
    not
    sure if deleting documents cause the version number of an
    IndexReader to
    change or not (if it does your job is easy) but even if it
    doesn't I'm
    guessing you could say that if hasDeletions() returns true, you
    have to
    assume you need to invalidate your cached bits (worst case scenerio
    you
    are invalidating the cache as often as it is now)


    : Date: Fri, 7 Jul 2006 00:32:54 -0500
    : From: robert engels <rengels@ix.netcom.com>
    : Reply-To: java-dev@lucene.apache.org
    : To: Lucene-Dev <java-dev@lucene.apache.org>
    : Subject: MultiSegmentQueryFilter enhancement for interactive
    indexes?
    :
    : I thought of a possible enhancement - before I go down the road,
    I am
    : looking for some input form the community?
    :
    : Currently, the QueryFilter caches the bits base upon the
    IndexReader.
    :
    : The problem with this is small incremental changes to the index
    : invalidate the cache.
    :
    : What if instead the filter determined that the underlying
    IndexReader
    : was a MultiReader and then maintained a bitset for each reader,
    : combining them in bits() when requested. The filter could check if
    : any of the underlying readers were the different (removed or added)
    : and then just create a new bitset for that reader. With the new non-
    : bit set filter implementations this could be even more memory
    : efficient since the bitsets would not need to be combined into a
    : single bitset.
    :
    : With the previous work on "reopen" so that segments are
    reused, this
    : would allow filters to be far more useful in a highly interactive
    : environment.
    :
    : What do you think?
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley at Jul 8, 2006 at 3:11 am

    On 7/7/06, robert engels wrote:
    Exactly. I have been watching to see how the new filer interface
    works out for 2.0. I am still not certain why it is so involved.

    I still think

    interface Filter {
    boolean include(int doc);
    int nextInclude(int doc);
    }

    should suffice.
    It depends on how the interface will be used and the capabilities of
    the underlying implementation.
    Paul's sorted vint list is good on space, but it doesn't do random access well.
    Solr's HashDocSet offers very fast random access, but it doesn't do
    sequential access.
    A BitSet does both, but it's always big.
    There are many other possible implementations & tradeoffs, but you get
    the idea...

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jul 8, 2006 at 3:44 am
    Agreed. The interface I proposed supports both sequential and random
    access to the filter - hiding the implementation.

    On Jul 7, 2006, at 10:10 PM, Yonik Seeley wrote:
    On 7/7/06, robert engels wrote:
    Exactly. I have been watching to see how the new filer interface
    works out for 2.0. I am still not certain why it is so involved.

    I still think

    interface Filter {
    boolean include(int doc);
    int nextInclude(int doc);
    }

    should suffice.
    It depends on how the interface will be used and the capabilities of
    the underlying implementation.
    Paul's sorted vint list is good on space, but it doesn't do random
    access well.
    Solr's HashDocSet offers very fast random access, but it doesn't do
    sequential access.
    A BitSet does both, but it's always big.
    There are many other possible implementations & tradeoffs, but you get
    the idea...

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene
    search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot at Jul 8, 2006 at 7:25 am

    On Saturday 08 July 2006 05:44, robert engels wrote:
    Agreed. The interface I proposed supports both sequential and random
    access to the filter - hiding the implementation.
    For query searching, random access to a Filter is only needed
    in the forward direction, e.g. by nextInclude(docNr) or skipTo(docNr).

    As for why it's so involved:

    Making a "rewritten" Filter work more like a Scorer has the advantage
    that combinations of filters can (also) be evaluated using the same
    mechanisms as currently existing for Scorers. For this, some additions
    to the existing code will be needed, like adding an
    add(Filter, BooleanClause.Occur) to BooleanQuery, and a similar
    addition of a Matcher (proposed superclass of Scorer to "rewrite" a
    Filter to) to some of the underlying scorers.
    Such occurrences of filters are only "must" and "must not", "should"
    doesn't make sense because there is no score value.

    Also, it makes sense to have an explain() method for a "rewritten"
    Filter, because it can be for searching a query.

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jul 8, 2006 at 12:15 pm
    Is that really necessary for a filter? It seems that a filter implies
    efficiency over a "scoring", and that filters should be able to be
    evaluated in a chained (or priority queue) fashion fairly efficiently
    without any need for 'rewrites".

    With the new incremental updates of a filter (based upon a query) it
    seems that the newly proposed filtering could be far less efficient.

    I think a filter change that just removes the BitSet dependency is
    all that is needed, and anything else is overkill, but I admit I am
    probably missing something here.

    If these changes will eventually allow for efficient filtering based
    upon non-indexed stored fields I am all for it.
    On Jul 8, 2006, at 2:24 AM, Paul Elschot wrote:
    On Saturday 08 July 2006 05:44, robert engels wrote:
    Agreed. The interface I proposed supports both sequential and random
    access to the filter - hiding the implementation.
    For query searching, random access to a Filter is only needed
    in the forward direction, e.g. by nextInclude(docNr) or skipTo(docNr).

    As for why it's so involved:

    Making a "rewritten" Filter work more like a Scorer has the advantage
    that combinations of filters can (also) be evaluated using the same
    mechanisms as currently existing for Scorers. For this, some additions
    to the existing code will be needed, like adding an
    add(Filter, BooleanClause.Occur) to BooleanQuery, and a similar
    addition of a Matcher (proposed superclass of Scorer to "rewrite" a
    Filter to) to some of the underlying scorers.
    Such occurrences of filters are only "must" and "must not", "should"
    doesn't make sense because there is no score value.

    Also, it makes sense to have an explain() method for a "rewritten"
    Filter, because it can be for searching a query.

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jul 8, 2006 at 12:21 pm
    Attached is the incremental updating QueryFilter. The MyMultiReader
    class is a basic extension of MultiReader that allows access to the
    underlying IndexReaders.

    Obviously this also requires that "reopen" be implemented so that
    SegmentReaders for unchanged segments remain the same through the open.
  • Paul Elschot at Jul 9, 2006 at 3:40 am
    Robert,

    Thanks for your questions, things are beginning to fall into place
    (see http://issues.apache.org/jira/browse/LUCENE-584):
    On Saturday 08 July 2006 14:14, robert engels wrote:
    Is that really necessary for a filter? It seems that a filter implies
    efficiency over a "scoring", and that filters should be able to be
    The proposed Matcher is superclass of Scorer formed by leaving
    out all the methods dealing with score values.
    evaluated in a chained (or priority queue) fashion fairly efficiently
    The current DisjunctionSumScorer has a priority queue. I have to say
    that I did not yet consider a filter clause to a boolean query that is
    based on an disjunction of filters: in this case the "should" occurrence
    makes sense, but calling it a query is overdoing it, the disjunction
    would be a Filter itself.

    In principle, it is possible to evaluate a disjunction over filters during a
    query search, and it might even make sense when the disjunction is
    skipTo'd into infrequently as one of the required clauses in a boolean query.
    I have no idea whether this would be useful in practice.

    Also, in the same way as for top level disjunction queries, for filters
    there are more efficient methods of dealing with top level disjunction
    than a priority queue, see for example RangeFilter that collects
    all matching docs in a BitSet by iterating the TermScorers in the range
    one by one.

    The distinction between top level evaluation and nested evaluation is
    in the proposed Matcher: it has a match(MatchCollector) method for the top
    level, and the doc(), next() and skipTo() can be used for nested evaluation.
    The same distinction exists in Scorer: score(HitCollector, ...) and roughly
    the rest.
    without any need for 'rewrites".
    Rewriting of a query is a way to make an association between
    a query and one or more index readers. The same association is currently
    present for a Filter in the bits(IndexReader) method, proposed
    to be deprecated.
    Perhaps the proposed getMatcher(IndexReader) method should
    be called Filter.rewrite(IndexReader), just as Query.rewrite(IndexReader).
    With the new incremental updates of a filter (based upon a query) it
    seems that the newly proposed filtering could be far less efficient.
    A Filter can be composed in the same way as an IndexReader can use
    multiple segments. Also, document deletion in a segment is currently done
    by a special purpose bit set.
    For incremental updates, the "rewriting" of a filter could be limited to the
    filter component associated with the newly added segment(s).
    I think a filter change that just removes the BitSet dependency is
    all that is needed, and anything else is overkill, but I admit I am
    I thought so, too. But then I realized that there are many things shared
    between current Scorers and Filters. These things are dealing mostly
    with matching and not at all with scoring.
    probably missing something here.
    Perhaps a method to provide a complete Explanation of why a document
    matches, or does not match, a filtered query?
    If these changes will eventually allow for efficient filtering based
    upon non-indexed stored fields I am all for it.
    For the non indexed case, there is no choice but to read all stored data
    and evaluate a boolean function on the field of each document.
    I think the only efficiency to be gained there is in reading the stored
    fields, but iirc that has been fixed.
    For the indexed case a TermScorer is a Scorer is a proposed Matcher.
    The norms can already be left out, so the only things "left to be left out"
    are the term frequencies and positions. Once that is done there is no
    more need to use a non-indexed stored field for filtering, because an
    indexed-only field would always be more efficient in indexed data size.

    Regards,
    Paul Elschot
    On Jul 8, 2006, at 2:24 AM, Paul Elschot wrote:
    On Saturday 08 July 2006 05:44, robert engels wrote:
    Agreed. The interface I proposed supports both sequential and random
    access to the filter - hiding the implementation.
    For query searching, random access to a Filter is only needed
    in the forward direction, e.g. by nextInclude(docNr) or skipTo(docNr).

    As for why it's so involved:

    Making a "rewritten" Filter work more like a Scorer has the advantage
    that combinations of filters can (also) be evaluated using the same
    mechanisms as currently existing for Scorers. For this, some additions
    to the existing code will be needed, like adding an
    add(Filter, BooleanClause.Occur) to BooleanQuery, and a similar
    addition of a Matcher (proposed superclass of Scorer to "rewrite" a
    Filter to) to some of the underlying scorers.
    Such occurrences of filters are only "must" and "must not", "should"
    doesn't make sense because there is no score value.

    Also, it makes sense to have an explain() method for a "rewritten"
    Filter, because it can be for searching a query.

    Regards,
    Paul Elschot

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Bruce Ritchie at Jul 11, 2006 at 3:32 am
    Robert,

    Can you quantify 'through the roof' a bit? Are the filters that you are
    creating that expensive to create or is it the usage of BitSets that are
    the real cause of the performance improvement you've seen?


    Regards,

    Bruce Ritchie

    -----Original Message-----
    From: robert engels
    Sent: Friday, July 07, 2006 9:35 PM
    To: java-dev@lucene.apache.org
    Subject: Re: MultiSegmentQueryFilter enhancement for interactive
    indexes?

    I implemented it and it works great. I didn't worry about the deletions
    since by the time a filter is used the deleted documents are already
    removed by the query. The only problem that arose out of this was for
    things like the ConstantScoreQuery (which uses a filter)
    - I needed to modify this query to ignore deleted documents.

    Now I have incremental cached filters - the query performance is going
    through the roof.


    On Jul 7, 2006, at 2:47 PM, Chris Hostetter wrote:


    I'm no segments/MultiReader expert, but your idea sounds good to me
    ... it seems like it would certainly work in the "new segments"
    situation.

    One thing i don't see you mention is dealing with deletions ... i'm
    not sure if deleting documents cause the version number of an
    IndexReader to change or not (if it does your job is easy) but even if
    it doesn't I'm guessing you could say that if hasDeletions() returns
    true, you have to assume you need to invalidate your cached bits
    (worst case scenerio you are invalidating the cache as often as it is
    now)


    : Date: Fri, 7 Jul 2006 00:32:54 -0500
    : From: robert engels <rengels@ix.netcom.com>
    : Reply-To: java-dev@lucene.apache.org
    : To: Lucene-Dev <java-dev@lucene.apache.org>
    : Subject: MultiSegmentQueryFilter enhancement for interactive
    indexes?
    :
    : I thought of a possible enhancement - before I go down the road, I
    am
    : looking for some input form the community?
    :
    : Currently, the QueryFilter caches the bits base upon the
    IndexReader.
    :
    : The problem with this is small incremental changes to the index
    : invalidate the cache.
    :
    : What if instead the filter determined that the underlying
    IndexReader
    : was a MultiReader and then maintained a bitset for each reader,
    : combining them in bits() when requested. The filter could check if
    : any of the underlying readers were the different (removed or added)
    : and then just create a new bitset for that reader. With the new non-
    : bit set filter implementations this could be even more memory
    : efficient since the bitsets would not need to be combined into a
    : single bitset.
    :
    : With the previous work on "reopen" so that segments are reused, this
    : would allow filters to be far more useful in a highly interactive
    : environment.
    :
    : What do you think?
    :
    :
    :
    :
    :
    :
    :
    ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-dev-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jul 11, 2006 at 3:43 am
    Creation of the filters is very expense - usually involves a large
    range query. We also convert all range and prefix queries to filters
    since scoring these does not make sense to us...

    For example, show sales where the sales price was > 0 and less than
    500k. Frequently the user will get too many results, and so then he
    will add another term (like the neighborhood).

    Having the "sale price" filter cached helps performance immensely.

    This is a bit of a contrived example (since sales are not updated
    very frequently).

    A better example comes from Nutch. In the query optimizer, terms with
    a 0 boost that occur in N percent of the documents are converted into
    a filter. Having to recreate his filter everytime a docment is added
    is very expensive. With this change there is no performance hit to
    using the filter optimzation with highly interactive indices.
    On Jul 10, 2006, at 10:31 PM, Bruce Ritchie wrote:

    Robert,

    Can you quantify 'through the roof' a bit? Are the filters that you
    are
    creating that expensive to create or is it the usage of BitSets
    that are
    the real cause of the performance improvement you've seen?


    Regards,

    Bruce Ritchie

    -----Original Message-----
    From: robert engels
    Sent: Friday, July 07, 2006 9:35 PM
    To: java-dev@lucene.apache.org
    Subject: Re: MultiSegmentQueryFilter enhancement for interactive
    indexes?

    I implemented it and it works great. I didn't worry about the
    deletions
    since by the time a filter is used the deleted documents are already
    removed by the query. The only problem that arose out of this was for
    things like the ConstantScoreQuery (which uses a filter)
    - I needed to modify this query to ignore deleted documents.

    Now I have incremental cached filters - the query performance is going
    through the roof.


    On Jul 7, 2006, at 2:47 PM, Chris Hostetter wrote:


    I'm no segments/MultiReader expert, but your idea sounds good to me
    ... it seems like it would certainly work in the "new segments"
    situation.

    One thing i don't see you mention is dealing with deletions ... i'm
    not sure if deleting documents cause the version number of an
    IndexReader to change or not (if it does your job is easy) but
    even if
    it doesn't I'm guessing you could say that if hasDeletions() returns
    true, you have to assume you need to invalidate your cached bits
    (worst case scenerio you are invalidating the cache as often as it is
    now)


    : Date: Fri, 7 Jul 2006 00:32:54 -0500
    : From: robert engels <rengels@ix.netcom.com>
    : Reply-To: java-dev@lucene.apache.org
    : To: Lucene-Dev <java-dev@lucene.apache.org>
    : Subject: MultiSegmentQueryFilter enhancement for interactive
    indexes?
    :
    : I thought of a possible enhancement - before I go down the road, I
    am
    : looking for some input form the community?
    :
    : Currently, the QueryFilter caches the bits base upon the
    IndexReader.
    :
    : The problem with this is small incremental changes to the index
    : invalidate the cache.
    :
    : What if instead the filter determined that the underlying
    IndexReader
    : was a MultiReader and then maintained a bitset for each reader,
    : combining them in bits() when requested. The filter could check if
    : any of the underlying readers were the different (removed or added)
    : and then just create a new bitset for that reader. With the new
    non-
    : bit set filter implementations this could be even more memory
    : efficient since the bitsets would not need to be combined into a
    : single bitset.
    :
    : With the previous work on "reopen" so that segments are reused,
    this
    : would allow filters to be far more useful in a highly interactive
    : environment.
    :
    : What do you think?
    :
    :
    :
    :
    :
    :
    :
    ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-dev-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedJul 7, '06 at 5:33a
activeJul 11, '06 at 3:43a
posts13
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase