Grokbase Groups Lucene dev March 2009
FAQ
Refactoring Lucene collectors (HitCollector and extensions)
-----------------------------------------------------------

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9


This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].

We have agreed to do the following refactoring:
* Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
* Deprecate HitCollector in favor of the new Collector.
* Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
** It will remove any instanceof checks that currently exist in IndexSearcher code.
* Create a new (abstract) TopDocsCollector, which will:
** Leave collect and setNextReader unimplemented.
** Introduce protected members PriorityQueue and totalHits.
** Introduce a single protected constructor which accepts a PriorityQueue.
** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
* Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
* Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
* Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
* Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.

Additionally, the following proposal was made w.r.t. decoupling score from collect():
* Change collect to accecpt only a doc Id (unbased).
* Introduce a setScorer(Scorer) method.
* If during collect the implementation needs the score, it can call scorer.score().
If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
* What if during collect() Scorer is null? (i.e., not set) - is it even possible?
* I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?

Open issues:
* The name for Collector
* TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
* Decoupling score from collect().

I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Michael McCandless (JIRA) at Mar 27, 2009 at 8:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690094#action_12690094 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    Looks good! Thanks Shai. Some responses:

    bq. Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.

    This turns deprecated HitCollector into a Collector? Seems like it
    should be package private?

    bq. Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)

    This is deprecated, so we shouldn't add topDocs(start, howMany)? I
    think just switch it back to extending the deprecated TopDocCollector
    (like it does in 2.4)?

    bq. What if during collect() Scorer is null? (i.e., not set) - is it even possible?

    I think Lucene should guarantee not to do that?

    bq. I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?

    Hmmmm good point. I would love to stop screening for 0 score in the
    core collectors (like Solr). Maybe we fix the core collectors to not
    screen by zero score, but we add a new "only keep positive scores"
    collector chain/wrapper class that does the filtering and the forwards
    collection to another collector? This way there's a migration path if
    somehow users are relying on this.

    And we should note this difference clearly in the javadocs for the new
    hierarchy.

    bq. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)

    I think it's fine if it's the same issue, though doing it as 2 patches
    is going to make life difficult. I think a single patch covering
    changes to src/java, and one to src/test is OK, though I'd personally
    prefer just one patch overall.

    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 28, 2009 at 2:18 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693443#action_12693443 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    {quote}
    This turns deprecated HitCollector into a Collector? Seems like it
    should be package private?
    {quote}

    Initially I wrote it but then deleted. I decided to make the decision as I create the patch. If this will be used only in IndexSearcher, then it should be a private static final class in IndexSearcher, otherwise a package private one. However, if it turns out we'd want to use it for now in other places too where we deprecate the HitCollector methods, then it will be public.
    Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible.

    {quote}
    This is deprecated, so we shouldn't add topDocs(start, howMany)? I
    think just switch it back to extending the deprecated TopDocCollector
    (like it does in 2.4)?
    {quote}

    That's a good idea.

    {quote}
    Hmmmm good point. I would love to stop screening for 0 score in the
    core collectors (like Solr). Maybe we fix the core collectors to not
    screen by zero score, but we add a new "only keep positive scores"
    collector chain/wrapper class that does the filtering and the forwards
    collection to another collector? This way there's a migration path if
    somehow users are relying on this.
    {quote}

    I can do that. Create a FilterZeroScoresCollector which wraps a Collector and passes forward only documents with score > 0. BTW, how can a document get a zero score?

    I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Mar 28, 2009 at 2:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693448#action_12693448 ]

    Marvin Humphrey commented on LUCENE-1575:
    -----------------------------------------
    BTW, how can a document get a zero score?
    Any number of ways, since Query and Scorer are extensible. How about a RandomScoreQuery that uses floor(rand(1.9))? Or say that you have a bitset of docs which should match and you use that to feed a scorer. What score should you assign? Why not 0? Why not -1? Should it matter?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 28, 2009 at 4:01 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693462#action_12693462 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit?

    OK that sounds great. The back-compat tests will also assert nothing broke.

    bq. Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible.

    OK.

    bq. BTW, how can a document get a zero score?

    I've wondered the same thing. There was this thread recently:

    http://www.nabble.com/TopDocCollector-td22244245.html
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 28, 2009 at 4:31 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693469#action_12693469 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    After I posted the question on how can a document get a 0 score, I realized that it's possible due to extensions of Similarity for example. Thanks Marvin for clearing that up. I guess though that the Lucene core classes will not assign <= 0 score to a document?

    Anyway, whether it's true or not, I think I agree with Mike saying we should remove this screening from the core collectors. If my application extends Lucene in a way that it can assign <= 0 scores to documents, and it has the intention of screening those documents, it should use the new FilterZeroScoresCollector (maybe call it OnlyPositiveScoresCollector?)

    I don't think that assigning <= 0 score to a document necessarily means it should be removed from the result set.

    However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on <=0 score documents? I mean, what if my application relies on that and extended Lucene in a way that it sometimes assigns 0 scores to documents? Now when I'll switch to 2.9, those documents won't be filtered. I will be able to use the new FilterZeroScoresCollector, but that'll require me to change my app's code.

    Maybe just do it for the new collectors (TopScoreDocCollector and TopFieldCollector)? I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 28, 2009 at 5:09 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693470#action_12693470 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------


    bq. However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on <=0 score documents?

    Hmm right there is, because the search methods will use the new collectors.

    bq. I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine?

    Actually there's no change to your code required (the search methods should use the new collectors). So we do have a back-compat difference.

    We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the "PositiveScoresOnlyCollector" wrapper? I think the vast majority of users are not relying on <= 0 scoring docs to be filtered out.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 28, 2009 at 6:23 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693474#action_12693474 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the "PositiveScoresOnlyCollector" wrapper?

    Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right?

    I guess you are referring to the methods which don't take a collector as a parameter and instantiate a new TopScoreDocCollector internally? I tend to think that if someone uses those, it is just because they are simple, and I find it very hard to imagine that that someone relies on the filtering. So perhaps we can get away with just documenting the change in behavior?

    bq. I think the vast majority of users are not relying on <= 0 scoring docs to be filtered out.

    I tend to agree. This has been around for quite some time. I checked my custom collectors, and they do the same check. I only now realize I just followed the code practice I saw in Lucene's code, never giving it much thought of whether this can actually happen. I believe that if I'd have extended Lucene in a way such that it returns <=0 scores, I'd be aware of that and probably won't use the built-in collectors. I see no reason to filter <= 0 scored docs anyway, and if I wanted that, I'd probably write my own filtering collector ...

    I think that if we don't believe people rely on the <= 0 filtering, let's just document it. I'd hate to add a setter method to IndexSearcher, and a unit test, and check where else it should be added (i.e., in extending searcher classes) and introduce a new API which we might need to deprecate some day ...
    People who'll need that functionality can move to use the methods that accept a Collector, and pass in the PositiveScoresOnlyCollector. That way we also keep the 'fast and easy' search methods really simple, fast and easy.

    Is that acceptable?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 28, 2009 at 6:51 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693478#action_12693478 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right?

    OK, I agree. Let's add an entry to the top of CHANGES.txt that states this [minor] break in back compatibility, as well as the code fragment showing how to use that filter to get back to the pre-2.9 way?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 29, 2009 at 3:57 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693513#action_12693513 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    Great !
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 29, 2009 at 6:17 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693524#action_12693524 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    BooleanScorer defines an internal package private static final Collector class. Two questions:
    # May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors)
    # May I change it to private static final? It is used only in BooleanScorer's newCollector() method.
    I think the two are safe because it's already package-private and there's no other Lucene code which uses it.

    BTW, we might wanna review BooleanScorer's internal classes visibility. They are all package-private, with some public methods, however used by BooleanScorer only ... But that's something for a different issue.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 29, 2009 at 9:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693533#action_12693533 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors)

    bq. May I change it to private static final? It is used only in BooleanScorer's newCollector() method.

    I think these are fine.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 30, 2009 at 11:31 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693729#action_12693729 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    I think as part of this we should allow TopFieldCollector to NOT get the score of each hit? EG another boolean to the ctor?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 30, 2009 at 11:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693732#action_12693732 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    I am not sure what you mean - score is used all over the place in collect() as well as other methods. updateBottom for example takes a score, updates bottom.score and then calls adjustTop(). Do you mean that if ignoreScore is true (in ctor), then setScorer should not save the Scorer and not call scorer.score()? If so, what should I do with all the methods that accept score? Create another code path in TopFieldCollector which ignore the score?

    Also, what should the default value be? true (for ignoring scores)?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 30, 2009 at 12:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693743#action_12693743 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    Ok I now understand better where score is used in TopFieldCollector ... It is used in a number of places, two important are:
    # Maintain maxScore for returning in TopFieldDocs.
    # Passed to FieldComparator.compareBottom which takes a doc and score parameters. There is one comparator RelevanceComparator which makes use of the score passed, however that's part of the method signature.
    # Passed to FieldComparator.copy - again used by RelevanceComparator only.
    # Passed to updateBottom, which updates the score of the least element in the queue and then calls adjustTop().

    * Number 2, 3 and 4 can be resolved by adding a setScorer to FieldComparator (as empty implementation) which TopFieldCollector will call in each collect() call, passing the Scorer that was given to it in its setScorer.
    * Then, we override that in RelevanceComparator, saving the Scorer and using it whenever the score is needed. Of course we'll need to save the current score, so that we don't call score() too many times for the same document.
    * This eliminates the need to define on TopFieldCollector whether scores should be saved. The reason is that the Sort parameter may include a SortField.SCORE field, which will invoke the RelevanceComparator.

    The question is what to do with maxScore? It is needed for TopDocs / TopFieldDocs. It may also be important to know the maxScore of a query, even if you sort it by something which is not a score.

    Question is - if the steps above make sense, why should we do them at all? :)
    Now the score is computed and passed on to every FieldComparator we received in Sort. Cleaning the method signature means additional code overhead in RelevanceComparator. If we want to compute maxScore as well, it means the score will be computed twice, once in collect() and once in RelevanceComparator.

    We can solve the double score() computation by using an internal ScoreCacheScorer which keeps the score of the current document and returns it whenever score() is called, unless it's a new document and then it delegates the call to the wrapped Scorer. TopFieldCollector can instantiate it in setScorer.

    But this looks quite a lot for cleaning a method signature, don't you think? Of course if you can suggest how we somehow remove the maxScore computation, then it might be a good change, since only if SortField.SCORE is used, will the score be computed.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 30, 2009 at 2:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693778#action_12693778 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. The question is what to do with maxScore? It is needed for TopDocs / TopFieldDocs.

    This is why I was thinking you'd have to tell TopFieldCollector whether or not it should track scores. Furthermore, even if SortField.SCORE is not in your SortFields, an app may still want the scores to be enrolled in the TopFieldDocs, for presentation.

    Turning off scoring in TopFieldCollector's ctor just means 1) TopFieldCollector won't track max score, and 2) TopFieldCollector will leave score at 0 in the returned ScoreDoc array.

    bq. Number 2, 3 and 4 can be resolved by adding a setScorer to FieldComparator (as empty implementation) which TopFieldCollector will call in each collect() call, passing the Scorer that was given to it in its setScorer.

    +1

    It makes sense to push the same improvement (not always passing a score; instead, you ask the scorer for score if you need it) down into the FieldCollector API.

    bq. We can solve the double score() computation by using an internal ScoreCacheScorer which keeps the score of the current document and returns it whenever score() is called, unless it's a new document and then it delegates the call to the wrapped Scorer. TopFieldCollector can instantiate it in setScorer.

    +1

    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 30, 2009 at 2:31 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693787#action_12693787 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. Turning off scoring in TopFieldCollector's ctor just means 1) TopFieldCollector won't track max score, and 2) TopFieldCollector will leave score at 0 in the returned ScoreDoc array.

    Just to be clear - TopDocs as well as TopFieldDocs require a maxScore parameter in their ctor. So are you suggesting to pass something like Float.NaN as maxScore if scoring is turned off? Or introducing a new ctor which does not require maxScore, and defaults to Float.NaN? (or both?)

    bq. Furthermore, even if SortField.SCORE is not in your SortFields, an app may still want the scores to be enrolled in the TopFieldDocs, for presentation.

    Right - we should separate between getting score out of the FieldComparator API and tracking scores in TFC. If I don't have SortField.SCORE in my list of sort fields, then scorer.score() will not be called at all from the FieldComparators layer.

    Tracking scores in TFC is what I'm having troubles with. Turning it off does not necessarily improve anything .. Might be, and might not. In setScorer() I'd still need to register Scorer for passing on to FieldComparator. In collect() I'd still need to check whether score tracking is on, and if so, call scorer.score() and track maxScore. Note that if ScoreCacheScorer is used, then calling scorer.score() in collect does not have too much overhead.

    Also, what will be the default from a Lucene perspective? true - i.e., always keep track of scores?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 30, 2009 at 3:05 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693796#action_12693796 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. Or introducing a new ctor which does not require maxScore, and defaults to Float.NaN?

    I think that's a good approach, though for TopDocs the ctor should be package private I think (only called from TopFieldDocs' ctor)? And the javadocs should clearly spell out that this could happen (so people don't get scared on seeing Float.NaN coming back).

    bq. Turning it off does not necessarily improve anything .. Might be, and might not. In setScorer() I'd still need to register Scorer for passing on to FieldComparator. In collect() I'd still need to check whether score tracking is on, and if so, call scorer.score() and track maxScore. Note that if ScoreCacheScorer is used, then calling scorer.score() in collect does not have too much overhead.

    I think this is an improvement? (Scorer.score() will not have been called... that's the goal here).

    I guess we could also consider making a separate TopFieldCollector (NonScoringTopFieldCollector or some such), instead of sprinkling if statements all over the place.

    bq. Also, what will be the default from a Lucene perspective? true - i.e., always keep track of scores?

    Good question... we have the freedom to choose. Perhaps default to off? But say clearly in the migration javadocs that you have to set that to true to get same behavior as TSDC?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 30, 2009 at 3:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693812#action_12693812 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    ok I'll add another package-private ctor to TopDocs which does not get maxScore and defaults to NaN, as well as update the javadocs. No back-compat here since the only code that will use it is TFC, which is new.

    bq. I think this is an improvement? (Scorer.score() will not have been called... that's the goal here).

    Well, if we use ScoreCacheScorer, then this call is really fast, returning immediately and w/o computing the score.

    bq. I guess we could also consider making a separate TopFieldCollector (NonScoringTopFieldCollector or some such), instead of sprinkling if statements all over the place.

    I always like such approaches. How's that sound:
    # Create a base NSTFC, which has a protected score member, initialized to 0, and is exactly like the current TFC, only w/o maxScore.
    # Have TFC extend NSTFC, override collect() and:
    ## Set super.score = scorer.score(). That is required for updateBottom which updates the score on the ScoreDoc in pq.
    ## Compute maxScore.
    ## Call super.collect().
    ## Override topDocs(start, howMany) to provide one with maxScore.

    bq. Good question... we have the freedom to choose. Perhaps default to off? But say clearly in the migration javadocs that you have to set that to true to get same behavior as TSDC?

    So you suggest the methods on IndexSearcher today that take a Sort as parameter will default to NSTFC? As long as we document it it's ok? Are all of these new?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 30, 2009 at 4:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693822#action_12693822 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. How's that sound:

    That sounds good! So to be consistent maybe we create ScoringTopFieldCollector and NonScoringTopFieldCollector?

    This means we don't need ScoreCacheScorer? (because ScoringTopFieldCollector will always grab the score). Though how do we change FieldComparator API so as to not pass score around? All comparators except RelevanceComparator don't use it.

    bq. Well, if we use ScoreCacheScorer, then this call is really fast, returning immediately and w/o computing the score.

    I'm actually torn on how fast this will be: I think that will be an if statement that's hard for the CPU to predict, which is costly.

    bq. So you suggest the methods on IndexSearcher today that take a Sort as parameter will default to NSTFC? As long as we document it it's ok? Are all of these new?

    Hmmm... actually, no, I think those must continue to use NSTFC for the existing methods (to remain back compatible), but add a new search method that takes a boolean trackScore?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 30, 2009 at 5:01 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693843#action_12693843 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. So to be consistent maybe we create ScoringTopFieldCollector and NonScoringTopFieldCollector?

    And have STFC extend NSTFC? I see no reason to create an abstract TopFieldCollector.

    bq. This means we don't need ScoreCacheScorer? (because ScoringTopFieldCollector will always grab the score). Though how do we change FieldComparator API so as to not pass score around? All comparators except RelevanceComparator don't use it.

    I was actually thinking of that class for RelevanceComparator. So perhaps I can implement the logic inside RelevanceComparator? Although this sounds like a nice utility class, now that we have a setScorer on Collector - others may find it useful too.
    Remember that score-tracking is done for maxScore and ScoreDoc purposes (inside STFC). The score in the FieldComparator API is used only in RelevanceComparator, whether it's STFC or NSTFC.

    bq. I think those must continue to use NSTFC for the existing methods (to remain back compatible)

    Did you mean continue to use STFC? The current behavior is that scoring is tracked, I think.

    bq. add a new search method that takes a boolean trackScore?

    I actually prefer not to expose any more methods. IndexSearcher already has plenty of them. Instead, one can use the very generic, simple and useful method search(Query, Collector) and pass in a NSTFC instance. Otherwise we'll end up adding many search() methods to IndexSearcher, if we continue with that approach going forward.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 30, 2009 at 5:17 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693847#action_12693847 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. And have STFC extend NSTFC? I see no reason to create an abstract TopFieldCollector.

    Yes.

    bq. Although this sounds like a nice utility class, now that we have a setScorer on Collector - others may find it useful too.

    OK I agree, it would be useful to have for general usage (eg chaining collectors).

    But what is the plan now for the FieldComparator API? We no longer pass score all around, but expose access to scorer, which only RelevanceComparator (in core) will use?

    bq. Did you mean continue to use STFC? The current behavior is that scoring is tracked, I think.

    Sorry, yes, STFC. Beginning to lose mind...

    bq. I actually prefer not to expose any more methods. IndexSearcher already has plenty of them. Instead, one can use the very generic, simple and useful method search(Query, Collector) and pass in a NSTFC instance. Otherwise we'll end up adding many search() methods to IndexSearcher, if we continue with that approach going forward.

    I think that'd be OK... the only thing that bothers me is I think the natural default when sorting by field is to not gather the score. Ie I don't want someone evaluating Lucene in the future to say our field sort is too slow when they didn't realize they had to go use this advanced API that turns off scoring.

    What if we add a new method, and deprecate the old one? This way come 3.0 we will not have added any methods, and then when sorting by field you see that you have to choose with or without scores.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 31, 2009 at 5:58 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693997#action_12693997 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. But what is the plan now for the FieldComparator API? We no longer pass score all around, but expose access to scorer, which only RelevanceComparator (in core) will use?

    Yes. FieldComparator will have a default empty setScorer() method, which will be overridden by RelevanceComparator. In TopFieldCollector (forget the final name now) setScorer() method we set the Scorer on all FieldComparators. During collect(), only RelevanceComparator, if it exists, will compute the score.

    bq. What if we add a new method, and deprecate the old one?

    The current methods are:
    * Searchable - search(Weight, Filter, int, Sort)
    * Searcher - search(Query, Filter, int, Sort)
    * IndexSearcher - search(Weight, Filter, int, Sort, boolean /* fillFields */)

    Deprecate all three, and add the same but take another boolean as a parameter? I have two comments regarding that:
    # The current methods need to call the new ones with trackScores = true since that's the current behavior.
    # When we are left with only the new versions, I'm afraid those methods will not look 'simple fast' to a user - I now have to decide whether I want to track scores or not, something I haven't given much thought to before. I kind of like the current signature, but I understand your concern regarding defaults.

    BTW, Searchable is an interface, so we cannot add it there. Searcher is an abstract class and we cannot add the method to it with default implementation (as I believe the other search methods will call the new one with default=true). So it only leaves IndexSearcher as an option. But then what if someone uses MultiSearcher? ParallelMultiSearcher? etc.

    Is it possible to deprecate a method, documenting that its runtime behavior will change in 3.0 and then in 3.0 change to not track scores?

    If we're touching TopFieldCollector in such a way, I'd like to propose the following refactoring. It stems from the current complex implementation in collect() which checks in every collect call if we have just one Comparator or Multi, and we're talking about having two versions w/ and w/o score tracking:
    * Keep TopFieldCollector as abstract class with a static create() factory method, and an abstract updateBottom() method. It will still implement topDocs(start, howMany).
    * Have the factory method create one of 4 instances, all extend TFC, but are private internal classes, which do not concern the user:
    *# OneComparatorNonScoringTopFieldCollector - assumes just one FieldComparator exists as well as scoring should not be tracked.
    *# OneComparatorScoringTopFieldCollector - assumes just one FieldComparator exists as well as scoring should be tracked.
    *# MultiComparatorNonScoringTopFieldCollector - assumes more than one FieldComparator exists, as well as scoring should not be tracked.
    *# MultiComparatorScoringTopFieldCollector - assumes more than one FieldComparator exists, as well as scoring should be tracked.

    The advantages are:
    * We simplify the API - the user is only aware of TopFieldCollector, and instead of calling a ctor it calls a static create method, which takes all the arguments as the ctor takes.
    * We are free to create whatever instance is the right and most optimized one given the input parameters. The user does not care how the instance is called. Hence the long names - they are internal anyway.
    * The code is much cleaner and easy to understand. It also does not need to check if we have just one comparator or more in every call to collect.
    * We don't need to add a protected score to a NonScoring collector (read above about code readability) just because a Scoring one extends it and will make use of it.

    Since TopFieldCollector is new, we have the freedom to do it right w/o deprecating anything. I think it's a much cleaner design. It is orthogonal to the discussion we're having regarding the search methods and parameters. They will use the create() factory method instead of creating a collector, passing whatever arguments they have. So let's not confuse the two.

    The patch for this issue is ready. As soon as we agree on how to proceed with TFC, I'll add the changes and submit the patch.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 31, 2009 at 11:18 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694068#action_12694068 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. If we're touching TopFieldCollector in such a way, I'd like to propose the following refactoring

    +1. That looks great.

    bq. When we are left with only the new versions, I'm afraid those methods will not look 'simple fast' to a user

    I agree, which is why I'd like in 3.0 for the default to be "don't score when sorting by fields".

    bq. Is it possible to deprecate a method, documenting that its runtime behavior will change in 3.0 and then in 3.0 change to not track scores?

    I think this may in fact be our best option: don't deprecate the method, but document that in 3.0 this method will no longer do scoring. There is a precedent here: in 3.0, IndexReader.open is going to return readOnly readers by default (vs read/write today). We have also done similar fixes within a minor release, eg fixes to StandardAnalyzer. I think there are other things we should do (eg, StopFilter should enable position increment by default, which it doesn't today -- LUCENE-1258). If we do this approach, on committing this issue you should open a new one w/ fix version 3.0 to switch up the default.

    I think, with 3.0, if we clearly document in CHANGES, as well as on the particular APIs, the changes to Lucene's defaults, that's sufficient?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 31, 2009 at 12:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694074#action_12694074 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. If we do this approach, on committing this issue you should open a new one w/ fix version 3.0 to switch up the default.

    Ok then let's do that. I'll add a TODO to these methods that it should be changed in 3.0, and also open another issue. (The TODO is in case I forget to open the issue).

    I'll also add documentation to the CHANGES file as well as the API.

    Ok, I think the patch should be ready soon then. Just need to complete the refactoring to TopFieldCollector.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Mar 31, 2009 at 3:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694151#action_12694151 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    When I was about to make the changes to FieldComparator (add setScorer and calling scorer.score() when necessary) I noticed that scorer.score() declares it throws IOException, while the FieldComparatro methods don't. So two ways to handle it:
    # catch the IOException and ignore it, assuming a 0 score. This is if we think no Scorer will actually throw an IOException.
    # Change FieldComparator APIs to declare throwing IOE, which will give us flexibility in the future. Since the Lucene code throws IOE in many places, I don't think it's a problem.

    I'm in favor of (2) (I had to add IOE to Collector.collect() for that reason).
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Mar 31, 2009 at 3:32 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694155#action_12694155 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    I like 2 as well. I find reserving that freedom to be very helpful, while preventing it to be a real hassle later on...
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 8:51 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shai Erera updated LUCENE-1575:
    -------------------------------

    Attachment: LUCENE-1575.patch

    Eventually I decided to include just one patch file (instead of code and test) since it was simpler after all. Please be sure to review the following:
    # Collector class and documentation.
    # New TopDocsCollector class.
    # TopFieldCollector refactoring.
    # Methods deprecation.
    # New TestTopDocsCollector as well as test cases in TestSort.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 1, 2009 at 9:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694432#action_12694432 ]

    Uwe Schindler commented on LUCENE-1575:
    ---------------------------------------

    I just wonder, why HitCollectorWrapper implements:
    {code}
    public void collect(int doc, float score) {
    collector.collect(doc + base, score);
    }
    {code}
    This is not needed by Collector abstract class and never called.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 9:17 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shai Erera updated LUCENE-1575:
    -------------------------------

    Attachment: LUCENE-1575.1.patch

    oops :) leftovers from when it extended MultiReaderHitCollector (now called Collector)
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 1, 2009 at 9:17 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694439#action_12694439 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    Shai, it looks like you "svn copy"'d MultiReaderHitCollector.java --> Collector.java? It causes "patch" to be confused when applying the patch. The simple workaround is to pre-copy that file yourself, manually, before appying the patch.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 1, 2009 at 9:25 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694442#action_12694442 ]

    Uwe Schindler commented on LUCENE-1575:
    ---------------------------------------

    bq. oops leftovers from when it extended MultiReaderHitCollector (now called Collector)
    This is why we really should move to Java 1.5 soon and its @Override annotation...
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 9:29 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694443#action_12694443 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    I did not do any "svn copy", just used Eclipse refactoring to change the name of the class to Collector. I did not understand though from your comment if I should do it differently and post another patch, or is that a hint to how someone can still apply the patch?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • DM Smith at Apr 1, 2009 at 12:22 pm

    On Apr 1, 2009, at 5:29 AM, Shai Erera (JIRA) wrote:
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694443
    #action_12694443 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    I did not do any "svn copy", just used Eclipse refactoring to change
    the name of the class to Collector. I did not understand though from
    your comment if I should do it differently and post another patch,
    or is that a hint to how someone can still apply the patch?
    Assuming you have Subclipse installed into Eclipse:

    Subclipse will do an svn rename (aka move) when refactoring names. It
    does an add of the new name and a delete of the old, but retains
    history of the file. This is equal to svn copy followed by svn delete.

    If you create a copy of a file inside of Eclipse, Subclipse will do an
    svn copy.

    There is a Subclipse property DeferFileDelete that when set (with a
    value of "true") on a folder will change the delete behavior of all
    files below it. I set it on the root of my Eclipse projects, because I
    don't like Subclipse's delete behavior.

    I don't understand how it could mess up patching.

    I think what was suggested was to go to the file system and do an OS
    copy of the file. Or use Eclipse without Subclipse.

    -- DM



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 1, 2009 at 9:35 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694445#action_12694445 ]

    Uwe Schindler commented on LUCENE-1575:
    ---------------------------------------

    JavaDoc errors:
    - The Collector javadoc example still contains the score in its collect() method.
    - The javadocs of ParallelMultiSearcher's new Collector's {code}public void search(Weight weight, Filter filter, final Collector collector){code} has still HitCollector in its JavaDocs.

    This is what I found out when reading the new generated Javadocs.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 9:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shai Erera updated LUCENE-1575:
    -------------------------------

    Attachment: LUCENE-1575.2.patch

    Thanks Mike. I ran the javadocs task and found other mentions of MultiReaderHitCollector as well as fixed some more javadocs. BTW, the javadoc Ant task outputs many errors on missing files/names, but that something for another issue.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Andrzej Bialecki (JIRA) at Apr 1, 2009 at 9:57 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694452#action_12694452 ]

    Andrzej Bialecki commented on LUCENE-1575:
    -------------------------------------------

    I'm late to this discussion, so I may have missed something. Is there any provision in the new API for the early termination of hit collection? Currently the (time | count)-limited collectors that are used in Nutch and Solr have to throw RuntimeException to break the loop. It would be much more elegant if the new Collector.collect() had a way to signal the caller that it should stop looping without incurring the cost of throwing an Exception. E.g. by returning a boolean, or setting a flag in the caller.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 1, 2009 at 10:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694455#action_12694455 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. Thanks Mike.

    That was Uwe ;)
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 1, 2009 at 10:03 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694456#action_12694456 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    Looks great! Thanks Shai.

    My biggest question/issue is how TopDocsCollector now (still?)
    requires that the PQ you give it is sorting primarily by score (eg
    getMaxScore() assumes the max is in the PQ; topDocs() uses
    results[0]'s score when start == 0). We've sort of come full circle
    (a "hidden assumption" that you are sorting by score was what started
    the whole thread in the beginning) ;)

    We get away with that with TopFieldCollector because that overrides
    all the score-related processing.

    I'm not sure how to cleanly fix this... maybe TopDocsCollector should
    make maxScore() abstract? Or... it's as if we need the
    scoring/non-scoring bifurcation up higher (moved out of
    TopFieldCollector to above TopDocsCollector)?

    EG say I provide my own PQ that's sorting by date (by loading date
    from some external source, say); I may or may not care for score.

    Maybe we make a ScoringWrapperCollector that grabs score in its own
    collect, then calls collect() on its child?

    A few other small things:

    - In the private final static classes inside TopFieldCollector, you
    can make the members final too (eg reverseMul, comparator), in
    case it helps compiler.

    - Can you add a paragraph @ top of CHANGES stating the pending
    default swap in 3.0 (ie the same note you added to
    IndexSearcher.search). Add a new "Changes in backwards
    compatibility policy" section at the very top (look at how 2.4.0
    release did it). And can you give explicit code fragment showing
    how to get back to the old way (ie show that you must pass in
    "true" for trackDocScores).

    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 1, 2009 at 10:13 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694459#action_12694459 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. It would be much more elegant if the new Collector.collect() had a way to signal the caller that it should stop looping without incurring the cost of throwing an Exception. E.g. by returning a boolean, or setting a flag in the caller.

    I agree: we should work into this new collection API a way to stop early.

    But I'm nervous about the cost of checking a returned boolean on every collect call vs the cost of throwing/catching an exception. Adding the boolean check slows down every single collect() call (by just a bit, but bits really count here), even those that never use stop early (the majority of apps today). Throwing an exception adds a clear cost when you actually throw & catch it, but presumably that cost is proportionally tiny because you only throw it on searches that have been running for a long time, anyway.

    We could upgrade the exception to a checked exception; then we'd need to add "throws XXX" to the search methods in 3.0 (perhaps wrapping as RuntimeException until 3.0). But then I also wonder if the checked exception logic would add instructions in the collect() path.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 1, 2009 at 11:05 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694469#action_12694469 ]

    Uwe Schindler commented on LUCENE-1575:
    ---------------------------------------

    I think, a checked Exception to stop collecting would be the best. The "cost" of the exception is very minimal (it is only thrown once in the collector and catched somewhere at top level). So where would be the costly part? Microseconds per search?
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 11:25 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694474#action_12694474 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    Sorry -- thanks Uwe :)

    On Wed, Apr 1, 2009 at 1:01 PM, Michael McCandless (JIRA)


    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 11:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694483#action_12694483 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. I think, a checked Exception to stop collecting would be the best.

    I actually think a RuntimeException is better for the following reasons:
    * The effects of a checked exception added to collect() will not touch IndexSearcher only, but also some Scorer implementations.
    * That only exists in the TimeLimitedCollector, which is not instantiated by Lucene code, but rather by my application code. Therefore, reading its javadocs, I know I should wrap the searcher.search with a try-catch. On the other hand, the rest of the Lucene users will not need to do that since they don't use it.
    * Going forward, someone may come up with another limiting Collector implementation, for example "I want to stop collecting and searching as soon as I hit 10 documents" - what would we do then - add another checked exception?

    My point is - since such Collectors (at least now) are now instantiated by the search application and not by the Lucene code itself, RuntimeException is as good as checked exception, only they don't require any changes to the methods signature.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 12:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694494#action_12694494 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. Maybe we make a ScoringWrapperCollector that grabs score in its own collect, then calls collect() on its child?

    How about this:
    # I remove maxScore() completely from TopDocsCollector.
    # I define newTopDocs to return a TopDocs() w/o a maxScore (like we said, defaulting to Float.NaN). Also change the method signature to receive the start as well as ScoreDoc[] that was collected so far.
    # TopScoreDocCollector will override it, fetching maxScore and returning a TopDocs.
    #* TopFieldCollector will do the same, but returning TopFieldDocs.

    bq. Can you add a paragraph @ top of CHANGES stating the pending default swap in 3.0

    Done.

    bq. In the private final static classes inside TopFieldCollector, you can make the members final too

    Done.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 12:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shai Erera updated LUCENE-1575:
    -------------------------------

    Attachment: LUCENE-1575.3.patch

    Includes the latest comments from Mike.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 1, 2009 at 12:39 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694502#action_12694502 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. How about this:

    OK that looks good!

    Though we now don't have a single shared ancestor class that can give
    you a maxScore() when sorting by score or when sorting by fields. Not
    sure how big a loss that is though...

    More comments:

    * Should we implement a default
    TopDocsCollector.collect/setScorer/setNextReader? Ie,
    TopDocsCollector is supposed to be the "you provide the PQ and we
    do the rest" collector, most closely matching TopDocCollector.
    Or.... maybe we don't, because we don't know if we should compute
    the score for you, you may want to put something other than
    ScoreDoc into the queue, etc. Who knows what your class
    considers "top".

    * What happened to OnlyPositiveScoresFilter? (I don't see it).

    * It's a bit of a trap on calling either topDocs method in
    TopDocsCollector (or its subclasses) that it has popped everything
    from your PQ as a side-effect. This is technically a pre-existing
    issue, but the new bracketed version makes the trap more "trappy".
    For example, I can't call it N times once for each page -- I have
    to re-run the search to get another page's worth of results. Can
    you update javadocs eg something like "NOTE: you cannot call this
    method more than once" .

    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 1, 2009 at 12:45 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694505#action_12694505 ]

    Michael McCandless commented on LUCENE-1575:
    --------------------------------------------

    bq. I did not understand though from your comment if I should do it differently and post another patch, or is that a hint to how someone can still apply the patch?

    The patch command gets confused because it sees set of diffs against what looks to be a pre-existing Collector.java, but of course I have no Collector.java locally. "svn diff" did this because it knows you had renamed MRHC --> C.

    I think for now it's fine if those of us that need to apply the patch simply first copy MultiReaderHitCollector.java --> Collector.java.

    This is yet another example of how "patch" needs to be better integrated with svn: there needs to be a mirror "svn patch" to "svn diff" that's able to properly carry over all svn changes, perhaps do a 3way merge, etc.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Steven A Rowe at Apr 1, 2009 at 5:08 pm

    On 4/1/2009 at 8:45 AM, Michael McCandless wrote:
    The patch command gets confused because it sees set of diffs against
    what looks to be a pre-existing Collector.java, but of course I have no
    Collector.java locally. "svn diff" did this because it knows you had
    renamed MRHC --> C. [...]
    This is yet another example of how "patch" needs to be better
    integrated with svn: there needs to be a mirror "svn patch" to "svn
    diff" that's able to properly carry over all svn changes, perhaps do a
    3way merge, etc.
    The Subversion svnpatch-diff branch (a GSoC2007 project[1]), which adds an "svn patch" command to Subversion[2], finally got merged to trunk a month ago[3], but not in time to be included in the branch for the 1.6.X releases.

    Steve

    [1] "An augmented diff representation": <http://code.google.com/p/google-summer-of-code-2007-svn/downloads/list>; discussion on subversion-dev: <http://markmail.org/message/aqlwiehcwdhcgskq>
    [2] "svn patch": <https://svn.collab.net/viewvc/svn/trunk/notes/svnpatch>
    [3] Notice of svnpatch-diff branch merge to trunk: <http://www.nabble.com/-RFC--Merge-svnpatch-diff-branch-to-trunk-td22309583.html>
  • Michael McCandless at Apr 1, 2009 at 5:20 pm

    On Wed, Apr 1, 2009 at 1:07 PM, Steven A Rowe wrote:

    The Subversion svnpatch-diff branch (a GSoC2007 project[1]), which adds an "svn patch" command to Subversion[2], finally got merged to trunk a month ago[3], but not in time to be included in the branch for the 1.6.X releases.
    Hot damn! Maybe it's time to start using svn's trunk ;)

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 1:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694519#action_12694519 ]

    Shai Erera commented on LUCENE-1575:
    ------------------------------------

    bq. Though we now don't have a single shared ancestor class that can give you a maxScore() when sorting by score or when sorting by fields

    I don't think it's a big issue. We have just two extensions to TopDocsCollector, each tracking maxScore differently. Perhaps later on if more extensions are created, or a demand comes from users, we add a ScoringTopDocsCollector class?

    bq. Should we implement a default TopDocsCollector.collect/setScorer/setNextReader?

    I actually think that TopDocsCollector gives you exactly what I had in mind. When I started it, there was only collect(doc, score) method, and I wanted to have a base class that will implement getTotalHits and topDocs for you, while you only need to provide an implementation to collect(). Now collect has really become setNextReader, setScorer and collect. Meaning, you know what you want to do, TDC just takes care of creating the TopDocs for you, and you can even override topDocs(start, howMany) if you want to return a different TopDocs instance.

    I think now it's clean and simple.

    bq. What happened to OnlyPositiveScoresFilter? (I don't see it).

    Well ... you don't see it because I forgot to implement it :).
    Funny thing - all the tests pass without it, meaning all that time, we were filtering <= 0 scored documents, but never really tested it ... I guess if it's because we never really believed it'll happen?

    Anyway, I'll add such a class and a suitable test class (make sure it fails w/o using that wrapper first).

    bq. Can you update javadocs eg something like "NOTE: you cannot call this method more than once" .

    That has always been the case. Previously if you called topDocs() everything has been popped out of the queue for you. I understand what you say though, since we have the topDocs(start, howMany) API, nothing prevents a user from calling topDocs(0, 10) and topDocs(10, 10), only the second call will fail. However, I don't really think that's how people will use it ... and if they do, then perhaps they should just call topDocs() and do whatever they need on these ranges using the TopDocs object?
    I'll add that to the documentation as well.
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Apr 1, 2009 at 1:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Shai Erera updated LUCENE-1575:
    -------------------------------

    Attachment: LUCENE-1575.4.patch

    Adds:
    * PositiveScoresOnlyCollector and TestPositiveScoresOnlyCollector.
    * Relevant comments in CHANGES
    * Comments to TopDocsCollector.topDocs(start, howMany) and topDocs(start)
    Refactoring Lucene collectors (HitCollector and extensions)
    -----------------------------------------------------------

    Key: LUCENE-1575
    URL: https://issues.apache.org/jira/browse/LUCENE-1575
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Shai Erera
    Fix For: 2.9

    Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.patch


    This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
    We have agreed to do the following refactoring:
    * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
    * Deprecate HitCollector in favor of the new Collector.
    * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
    ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
    ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
    ** It will remove any instanceof checks that currently exist in IndexSearcher code.
    * Create a new (abstract) TopDocsCollector, which will:
    ** Leave collect and setNextReader unimplemented.
    ** Introduce protected members PriorityQueue and totalHits.
    ** Introduce a single protected constructor which accepts a PriorityQueue.
    ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
    ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
    * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
    * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
    * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
    * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.
    Additionally, the following proposal was made w.r.t. decoupling score from collect():
    * Change collect to accecpt only a doc Id (unbased).
    * Introduce a setScorer(Scorer) method.
    * If during collect the implementation needs the score, it can call scorer.score().
    If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
    * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
    * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
    Open issues:
    * The name for Collector
    * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
    * Decoupling score from collect().
    I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
    There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedMar 27, '09 at 4:45p
activeApr 24, '09 at 8:05p
posts137
users4
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase