FAQ
FieldCache should include a BitSet for matching docs
----------------------------------------------------

Key: LUCENE-2649
URL: https://issues.apache.org/jira/browse/LUCENE-2649
Project: Lucene - Java
Issue Type: Improvement
Reporter: Ryan McKinley
Fix For: 4.0


The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.

This should be changed to return an object representing the values *and* a BitSet for all valid docs.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Ryan McKinley (JIRA) at Sep 17, 2010 at 4:51 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910446#action_12910446 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    See some discussion here:
    http://search.lucidimagination.com/search/document/b6a531f7b73621f1/trie_fields_and_sortmissinglast
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 4:56 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ryan McKinley updated LUCENE-2649:
    ----------------------------------

    Attachment: LUCENE-2649-FieldCacheWithBitSet.patch

    This patch replaces the cached primitive[] with a CachedObject. The object heiarch looks like this:

    {code:java}

    public abstract static class CachedObject {

    }

    public abstract static class CachedArray extends CachedObject {
    public final Bits valid;
    public CachedArray( Bits valid ) {
    this.valid = valid;
    }
    };

    public static final class ByteValues extends CachedArray {
    public final byte[] values;
    public ByteValues( byte[] values, Bits valid ) {
    super( valid );
    this.values = values;
    }
    };
    ...
    {code}

    Then this @deprecates the getBytes() classes and replaces them with getByteValues()

    {code:java}

    public ByteValues getByteValues(IndexReader reader, String field)
    throws IOException;

    public ByteValues getByteValues(IndexReader reader, String field, ByteParser parser)
    throws IOException;

    {code}

    then repeat for all the other types!

    All tests pass with this patch, but i have not added any tests for the BitSet (yet)

    If people like the general look of this approach, I will clean it up and add some tests, javadoc cleanup etc

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 5:22 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ryan McKinley updated LUCENE-2649:
    ----------------------------------

    Attachment: LUCENE-2649-FieldCacheWithBitSet.patch

    A slightly simplified version
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 17, 2010 at 5:25 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910461#action_12910461 ]

    Uwe Schindler commented on LUCENE-2649:
    ---------------------------------------

    That looks exactly like I proposed it!

    The only thing: For DocTerms the approach is not needed? You can check for null, so the Bits interface is not needed. As the OpenBitSets are created with the exact size and don't need to grow, you can use fastSet to speed up creation by doing no bounds checks.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 17, 2010 at 5:32 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910464#action_12910464 ]

    Uwe Schindler commented on LUCENE-2649:
    ---------------------------------------

    When this is committed, we can improve also some Lucene parts: FieldCacheRangeFilter does not need to do extra deletion checks and instead use the Bits interface to find missing/non-valued documents. Lucene's sorting Collectors can be improved to have a consistent behaviour for missing values (like Solr's sortMissingFirst/Last).
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Sep 17, 2010 at 9:22 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910513#action_12910513 ]

    Michael McCandless commented on LUCENE-2649:
    --------------------------------------------

    Looks great!

    Should we make it optional, whether the valid bitset should be computed? Many apps wouldn't need it, so it just ties up (admittedly smallish amounts of) RAM unnecessarily?

    bq. Lucene's sorting Collectors can be improved to have a consistent behaviour for missing values (like Solr's sortMissingFirst/Last).

    +1

    Shouldn't we pull Solr's sortMissingFirst/Last down into Lucene?
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Simon Willnauer (JIRA) at Sep 17, 2010 at 11:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910539#action_12910539 ]

    Simon Willnauer commented on LUCENE-2649:
    -----------------------------------------

    bq. Should we make it optional, whether the valid bitset should be computed? Many apps wouldn't need it, so it just ties up (admittedly smallish amounts of) RAM unnecessarily?
    +1 we can save that overhead and high level apps can enable it by default if needed.


    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 12:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910554#action_12910554 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. Should we make it optional, whether the valid bitset should be computed?

    The trick is how to implement that (unless you mean just set it to true/false for all fields at once). Putting a flag on the FieldCache.getXXX methods is insufficient.
    Only the application knows if some of it's future uses of that field will require the bitset for matching docs, but it's Lucene that's often making the calls to the field cache.

    Perhaps FieldCache.Parser was originally just too narrow in scope - it should have been a factory method for handling all decisions about creating and populating a field cache entry?
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Simon Willnauer (JIRA) at Sep 17, 2010 at 1:27 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910568#action_12910568 ]

    Simon Willnauer commented on LUCENE-2649:
    -----------------------------------------

    bp. Perhaps FieldCache.Parser was originally just too narrow in scope - it should have been a factory method for handling all decisions about creating and populating a field cache entry?
    I guess we need to be able to manually configure FieldCache with some kind of FieldType. There have been several issues mentioning this and it keeps coming up again and again. I think it is just time to rethink Fieldable / Field and move towards some kind of flexible type definition for Fields in Lucene. A FieldType could then have a FieldCache Attribute which contains all necessary info including the parser and flags like the one we are talking about. Yet, before I get too excieted about FieldType, yeah something with a wider scope than FieldCache.Parser would work in this case. I don't know how far the FieldType is away but it can eventually replace whatever is going to be implemented here in regards to that flag.

    I think by default we should not enable the Bits feature but it must be explicitly set via whatever mechanism we gonna use.


    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 1:41 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910576#action_12910576 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. I guess we need to be able to manually configure FieldCache with some kind of FieldType.

    I don't know how well that would work. For one, there's only one FieldCache, so configuring it with anything seems problematic.
    Also, if I have to list out all the fields I'm going to use, that's another big step backwards.

    A factory would be a pretty straightforward way to increase the power, by allowing users to populate the entry through any mechanism, and optionally do extra calculations when the entry is populated (think statistics, sum-of-squares, etc).
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Simon Willnauer (JIRA) at Sep 17, 2010 at 2:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910589#action_12910589 ]

    Simon Willnauer commented on LUCENE-2649:
    -----------------------------------------

    bq. Also, if I have to list out all the fields I'm going to use, that's another big step backwards.
    I don't think that this is needed at all neither would it be a step backwards - not even near to that. But since we aren't on an issue about FieldType lets just drop it...

    bq. A factory would be a pretty straightforward way to increase the power, by allowing users to populate the entry through any mechanism, and optionally do extra calculations when the entry is populated (think statistics, sum-of-squares, etc).
    Whatever you call it (using Factory is fine) but isn't that what you mentioned to be insufficient? I mean this is something you would pass to a FieldCache.getXXX, right?
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Sep 17, 2010 at 2:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910591#action_12910591 ]

    Shai Erera commented on LUCENE-2649:
    ------------------------------------

    One thing I've wanted to do for a long time, but didn't get to doing it, is open up FieldCache to allow the application to populate the entries from other sources - specifically pyloads. I wrote a sorting solution which relies solely on payloads, and wanted to contribute it to Lucene, but due to lack's of FieldCache hook points, I didn't find the time to do the necessary refactoring.

    Sorting based on payloads-data has several advantages:
    # It's much faster to read than iterating on the lexicon and parsing the term values into sortable values.
    # If your application needs to cater sort over 10s of millions of documents, or if it needs to keep its RAM usage low, you can do the sort while reading the payload data as the search happens. It's faster than if it was in RAM, but the current FieldCache does not allow you to sort w/o RAM consumption.
    # You don't inflate your lexicon w/ sort values, affecting other searches. In some situations, you can add a unique term per document for the sort values (such as when sorting by date and require up to a millisecond precision).

    I'm bringing it up so that if you consider any refactoring to FieldCache, I'd appreciate if you can keep that in mind. If the right hooks will open up, I'll make time to contribute my sort-by-payload package. If you don't, then it'll need to wait until I can find the time to do the refactoring.


    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Simon Willnauer (JIRA) at Sep 17, 2010 at 2:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910568#action_12910568 ]

    Simon Willnauer edited comment on LUCENE-2649 at 9/17/10 9:59 AM:
    ------------------------------------------------------------------

    bq. Perhaps FieldCache.Parser was originally just too narrow in scope - it should have been a factory method for handling all decisions about creating and populating a field cache entry?
    I guess we need to be able to manually configure FieldCache with some kind of FieldType. There have been several issues mentioning this and it keeps coming up again and again. I think it is just time to rethink Fieldable / Field and move towards some kind of flexible type definition for Fields in Lucene. A FieldType could then have a FieldCache Attribute which contains all necessary info including the parser and flags like the one we are talking about. Yet, before I get too excieted about FieldType, yeah something with a wider scope than FieldCache.Parser would work in this case. I don't know how far the FieldType is away but it can eventually replace whatever is going to be implemented here in regards to that flag.

    I think by default we should not enable the Bits feature but it must be explicitly set via whatever mechanism we gonna use.



    was (Author: simonw):
    bp. Perhaps FieldCache.Parser was originally just too narrow in scope - it should have been a factory method for handling all decisions about creating and populating a field cache entry?
    I guess we need to be able to manually configure FieldCache with some kind of FieldType. There have been several issues mentioning this and it keeps coming up again and again. I think it is just time to rethink Fieldable / Field and move towards some kind of flexible type definition for Fields in Lucene. A FieldType could then have a FieldCache Attribute which contains all necessary info including the parser and flags like the one we are talking about. Yet, before I get too excieted about FieldType, yeah something with a wider scope than FieldCache.Parser would work in this case. I don't know how far the FieldType is away but it can eventually replace whatever is going to be implemented here in regards to that flag.

    I think by default we should not enable the Bits feature but it must be explicitly set via whatever mechanism we gonna use.


    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 2:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910598#action_12910598 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. Whatever you call it (using Factory is fine) but isn't that what you mentioned to be insufficient? I mean this is something you would pass to a FieldCache.getXXX, right?

    I was suggesting handling it the same way as FieldCache.Parser - it's set on the SortField. But instead of just being able to control parsing of a term (which is too limited), it needs to be able to control everything. (This would solve Shai's needs too)
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 2:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910599#action_12910599 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. open up FieldCache to allow the application to populate the entries from other sources

    +1

    bq. specifically payloads

    If CSF did not exist, I'd be totally on board with this... but it looks to be right around the corner now. Are there any advantages to using payloads over CSF for fieldcache population?
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 3:14 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910618#action_12910618 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    This is a band-aid, but we could consider adding something like:
    {code:java}
    public void setCacheValidBitsForFields( Set<String> names );
    {code}
    on FieldCache, then checking if the field is in that set before making the BitSet

    When solr reads the schema, it could look for any fields have sortMissingLast and then call:
    {code:java}
    FieldCache.DEFAULT.setCacheValidBitsForFields()
    {code}

    The factory idea also sounds good, but i don't see how would work without big big changes
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Shai Erera (JIRA) at Sep 17, 2010 at 3:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910625#action_12910625 ]

    Shai Erera commented on LUCENE-2649:
    ------------------------------------

    bq. Are there any advantages to using payloads over CSF for fieldcache population?

    Well .. payloads already exist (in my code :)), while CSF is "just around the corner" for a long time. While the two ultimately achieve the same goal, CSF is more generic than just payloads, and if we'd want to take advantage of it w/ FieldCache, I assume we'll need to make more changes to FieldCache, because w/ CSF, people can store arbitrary byte[] and request to cache them. So sorting data is a subset of CSF indeed, but I think the road to CSF + CSF-FieldCache integration is long. But perhaps I'm not up-to-date and there is progress / someone actually working on CSF?

    Anyway, opening up FC to read from payloads seems to me a much easier solution, because besides reading the stuff from the payload, the rest of the classes continue to work the same (TopFieldCollector, Comparators etc.).

    Maybe a slight change to SortField will be required as well though, not sure yet.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 3:25 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910628#action_12910628 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    Uwe: "For DocTerms the approach is not needed..."

    Ya I realized this after looking at the patch I first submitted. In the first patch, the cache holds a CachedObject rather then just an Object. In the second, I changed back to just an Object so it does not need to wrap the DocTerms or DocTermsIndex

    For the RangeFilter, with optional Bits calculation, that could would look somethign like:
    {code:java}

    LongValues cached = FieldCache.DEFAULT.getLongValues(reader, field, (FieldCache.LongParser) parser);
    final long[] values = cached.values;
    if( cached.valid == null ) {
    // ignore deleted docs if range doesn't contain 0
    return new FieldCacheDocIdSet(reader, !(inclusiveLowerPoint <= 0L && inclusiveUpperPoint >= 0L)) {
    @Override
    boolean matchDoc(int doc) {
    return values[doc] >= inclusiveLowerPoint && values[doc] <= inclusiveUpperPoint;
    }
    };
    }
    else {
    final Bits valid = cached.valid;
    return new FieldCacheDocIdSet(reader, true) {
    @Override
    boolean matchDoc(int doc) {
    return valid.get(doc) && values[doc] >= inclusiveLowerPoint && values[doc] <= inclusiveUpperPoint;
    }
    };
    }
    {code}
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 3:43 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910635#action_12910635 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. public void setCacheValidBitsForFields( Set<String> names );

    Solr doesn't even know all of the fields at the time it reads it's schema. And even if it did... this would seem to break multi-core or anything trying to have more than one index where the fields are different. Seems like this needs to be passed down via SortField, just like FieldCache.Parser. A factory makes this a more generic method than adding additional params to SortField every time we think of something like this... then we can add stuff like getFieldCacheParser() and other stuff to the factory.

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 3:55 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910640#action_12910640 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    oh right -- thats true. Is a global flag sufficient?

    In lucene it could default to false and in solr default to true.

    I know we don't want to just keep adding more things to memory, but I'm not sure there is a huge win by selectively enabling and disabling some fields.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 4:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910645#action_12910645 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. oh right - thats true. Is a global flag sufficient?

    Yeah, solr could just always default it to on. We don't know what kind of ad-hoc queries people will throw at solr and the 3% size increase (general case 1/32) seems completely worth it.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 4:33 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ryan McKinley updated LUCENE-2649:
    ----------------------------------

    Attachment: LUCENE-2649-FieldCacheWithBitSet.patch

    I added a static flag to CachedArray:
    {code:java}
    public abstract static class CachedArray {
    public static boolean CACHE_VALID_ARRAY_BITS = false;

    public final Bits valid;
    public CachedArray( Bits valid ) {
    this.valid = valid;
    }
    };
    {code}
    and then set it to true in the SolrCore static initalizer.

    If folks are ok with this approach, I'll clean up the javadocs etc
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 4:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910653#action_12910653 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    FYI, I like the idea of revisiting the FieldCache, but i don't see a straightforward path.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 17, 2010 at 4:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910654#action_12910654 ]

    Uwe Schindler commented on LUCENE-2649:
    ---------------------------------------

    I am against the configuration option to enable the additional BitSet. The problem is that you cannot control it for each usage for the FieldCache, as it is a static flag. We agreed in the past that we will remove all static defaults from Lucene (e.g. BQ.maxClauseCount) together with sytem properties. This flag can cause strange problems with 3rd party code (like when you lower the BQ maxClauseCount and suddenly your queries fail).

    The overhead by the OpenBitSet is very marginal (for integers only 1/32, as Yonik said). If you have memory problems with the FieldCache, these 1/32 would not hurt you, as you should think about your whole configuration then (liek moving from ints to shorts or something like that).

    So: Please don't add any static defaults or sysprops! Please, please, please!
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Sep 17, 2010 at 4:57 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910660#action_12910660 ]

    Mark Miller commented on LUCENE-2649:
    -------------------------------------

    bq. I was suggesting handling it the same way as FieldCache.Parser - it's set on the SortField. But instead of just being able to control parsing of a term (which is too limited), it needs to be able to control everything. (This would solve Shai's needs too)

    We started down this path with LUCENE-831 - you could pass some *UnInverter on the sort field if i remember right, so that pretty much everything could be overridden. It has come up a lot - we really need this level of customizability eventually.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 5:01 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910663#action_12910663 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    I'm all for dropping the static flag and always calculating the valid bits -- it makes things accurate with minimal cost.

    I am sympathetic to folks who don't want this, and I'm not sure the cleanest way to support both options, or even if it is actually worthwhile.

    Do people see this 'option' as a showstopper? If so, is there an easy way to configure? without statics, the flag would need to be fetched from each parser, and the parser does not know what FieldCache it is used from (using FieldCache.DEFAULT is just as bad as the static flag IIUC)


    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Sep 17, 2010 at 5:14 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910668#action_12910668 ]

    Marvin Humphrey commented on LUCENE-2649:
    -----------------------------------------
    So: Please don't add any static defaults or sysprops! Please, please, please!
    +1

    No global variables which control behavior, please.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Sep 17, 2010 at 5:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910677#action_12910677 ]

    Michael McCandless commented on LUCENE-2649:
    --------------------------------------------

    I know it's only 3% (for ints... 12.5% for bytes), but, 3% here, 3% there and suddenly we're talking real money.

    Lucene can only stay lean and mean if we don't allow these little 3% losses here and there!!

    Let's try to find some baby-step (even if not clean -- we know FieldCache, somehow, needs to be fixed more generally) for today?

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 6:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910699#action_12910699 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. Let's try to find some baby-step (even if not clean - we know FieldCache, somehow, needs to be fixed more generally) for today?

    The cheapest option might be:

    {quote}
    public interface Parser extends Serializable {
    public boolean recordMissing();
    }
    {quote}

    A better option is to replace FieldCache.Parser in SortField to be FieldCache.EntryCreator.

    Oh, and if we're recording all the set bits, it would be really nice to record both
    - the number of values set
    - the number of unique values encountered

    Both should be zero or non-measurable cost (a counter++ that does not produce a data dependency can be executed in parallel on a free int execution unit)

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 6:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910699#action_12910699 ]

    Yonik Seeley edited comment on LUCENE-2649 at 9/17/10 2:01 PM:
    ---------------------------------------------------------------

    bq. Let's try to find some baby-step (even if not clean - we know FieldCache, somehow, needs to be fixed more generally) for today?

    The cheapest option might be:

    {code}
    public interface Parser extends Serializable {
    public boolean recordMissing();
    }
    {code}

    A better option is to replace FieldCache.Parser in SortField to be FieldCache.EntryCreator.

    Oh, and if we're recording all the set bits, it would be really nice to record both
    - the number of values set
    - the number of unique values encountered

    Both should be zero or non-measurable cost (a counter++ that does not produce a data dependency can be executed in parallel on a free int execution unit)


    was (Author: yseeley@gmail.com):
    bq. Let's try to find some baby-step (even if not clean - we know FieldCache, somehow, needs to be fixed more generally) for today?

    The cheapest option might be:

    {quote}
    public interface Parser extends Serializable {
    public boolean recordMissing();
    }
    {quote}

    A better option is to replace FieldCache.Parser in SortField to be FieldCache.EntryCreator.

    Oh, and if we're recording all the set bits, it would be really nice to record both
    - the number of values set
    - the number of unique values encountered

    Both should be zero or non-measurable cost (a counter++ that does not produce a data dependency can be executed in parallel on a free int execution unit)

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 6:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910700#action_12910700 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    Are people generally ok with the idea of global on/off? I think that is a reasonable approach... I agree that we should avoid static fields to control behavior. But do we avoid it at the cost of not allowing the option, or waiting till we rework FieldCache?

    If the consensus is that FieldCache needs to be reworked *before* somethign like this could be added, that's fine... i'll move on to other things. Any relatively easy suggestions for how to enable the option without a global static? (Note that FieldCache is already a global static -- at leaset FieldCache.DEFAULT is referenced a lot)

    Perhaps this could/should live in /trunk until a cleaner solution is viable?




    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 17, 2010 at 6:12 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910707#action_12910707 ]

    Uwe Schindler commented on LUCENE-2649:
    ---------------------------------------

    I am against that option! No static defaults!

    bq. the number of values set

    This is OpenBitSet.cardinality() ? I dont think we should add this extra cost during creation, as it can be retrieved quite easy if really needed.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 17, 2010 at 6:14 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910707#action_12910707 ]

    Uwe Schindler edited comment on LUCENE-2649 at 9/17/10 2:11 PM:
    ----------------------------------------------------------------

    I am against that option! No static defaults! (and if it *must* be there - default it to true on Lucene, too).

    bq. the number of values set

    This is OpenBitSet.cardinality() ? I dont think we should add this extra cost during creation, as it can be retrieved quite easy if really needed.

    was (Author: thetaphi):
    I am against that option! No static defaults!

    bq. the number of values set

    This is OpenBitSet.cardinality() ? I dont think we should add this extra cost during creation, as it can be retrieved quite easy if really needed.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 6:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910711#action_12910711 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    I like the idea of optionally caching the numdocs and unique values -- that would make sorting by this field faster -- the ArrayValues class could be easily augmented with this.

    The problem with augmenting the Parser class as you suggest is that we would have to rejiggy everything that touches parser. We would need different default classes for things that want or don't want the missing records. How do we handle this big:
    {code:java}
    if (parser == null) {
    try {
    return wrapper.getIntValues(reader, field, DEFAULT_INT_PARSER);
    } catch (NumberFormatException ne) {
    return wrapper.getIntValues(reader, field, NUMERIC_UTILS_INT_PARSER);
    }
    }
    {code}
    yuck

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 6:29 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910717#action_12910717 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    If ya care - don't pass a null parser! Otherwise you get the default.

    bq. This is OpenBitSet.cardinality()

    Which isn't free... and calculating it over and over again is silly if you care about those numbers.

    bq. I dont think we should add this extra cost during creation,

    I don't think it will add extra cost. I could be wrong, but I don't think it will be measurable.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 6:52 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910728#action_12910728 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    bq. If ya care - don't pass a null parser! Otherwise you get the default.

    What if I care, but somethign else (that does not care) asks for the value first? Seems odd to have so much depend on *who* asks for the value first

    bq. A better option is to replace FieldCache.Parser in SortField to be FieldCache.EntryCreator.

    How would that work? What if a filter creates the cache before the SortField?
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 7:05 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910736#action_12910736 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. What if I care, but somethign else (that does not care) asks for the value first? Seems odd to have so much depend on who asks for the value first

    As long as it *can* be passed everywhere that matters, then it's up to the application - which knows if it ever needs the missing values or not for that field. For solr, we could make it configurable per-field... but I'd prob default it to ON to avoid unpredictable weirdness.

    bq. What if a filter creates the cache before the SortField?

    If we have a filter that uses the field cache, then it should also be able to specify the same stuff that SortField can.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 7:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910740#action_12910740 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. I agree that we should avoid static fields to control behavior. But do we avoid it at the cost of not allowing the option, or waiting till we rework FieldCache?

    I agree with this sentiment - progress, not perfection. Being able to turn it on or off for everything in the process is better than nothing at all.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Sep 17, 2010 at 7:35 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910746#action_12910746 ]

    Yonik Seeley commented on LUCENE-2649:
    --------------------------------------

    bq. How would that work?

    We could start off simple - add only recordMissing functionallity and punt on the rest, while still leaving a place to add it.

    {code}
    class FieldCache {

    public static class EntryCreator {
    public boolean recordMissing() {
    return false;
    }

    public abstract Parser getParser();
    }
    {code}

    Not even sure if a whole hierarchy is needed at this point... in the future, we'd prob want

    {code}
    public static EntryCreatorInt extends EntryCreator {
    public IntValues getIntValues(IndexReader reader, String field) {... code currently in FieldCacheImpl that fills the fieldCahe...}
    ...
    }
    {code}

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 17, 2010 at 9:45 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910803#action_12910803 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    Maybe, but I'm still not sure this cleans things up enough to be worth the trouble -- ideally the API should be easy to have consistent results. I don't like that it would be too easy to mess things up if you the application does not use the same parser from various components (that may be in different libraries etc). Conceptually it makes sense to have settings about what is or is not cached attached to the FieldCache itself, not to the things that ask the FieldCache for its values -- and letting whoever asks first set the behavior for the next guy who asks (regardless of what they ask for!).

    If we are going to make it essentially required to always pass in the right Parser/EntryCreator, we should at least remove all the ways of not passing one in -- since that call is saying "use what ever is there, and the next guy who asks should be ok with it too"

    Does something like the EntryCreator idea fix -- or at least begin to fix -- the other FieldCache issues? If not, is it really worth introducing just to avoid a static variable?

    I think the best near term option is live with the static initializer, and fix it when the we rework the FieldCache to fix a host of other issues. For solr the default will be set to always calculate, for lucene... we will let Mike and Uwe duke it out :)








    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 17, 2010 at 9:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910805#action_12910805 ]

    Uwe Schindler commented on LUCENE-2649:
    ---------------------------------------

    Supporting different parsers is not an issue at all. You can call getBytes() with different parsers, you simply create two entries in the cache, as each parser produces a different cache instance. And getBytes() without parser is also fine, as then you get the default parser from the cache (which would not create a third instance!). - [Parser is part of the cache key]
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 19, 2010 at 12:22 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12911128#action_12911128 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    I thought of an optimization that could reduce memory usage...

    If all non-deleted documents have a value, we don't need a real BitSet -- just a Bits implementation that always returns true.

    That should save 3% (or 12.5%) here and there.

    - - - - - -

    On other thing to consider... do we want to remove the getXXXX functions that do not pass in a Parser? passing in null, is equivalent?
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 19, 2010 at 3:45 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12911157#action_12911157 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    Here is the code for ByteValues that:
    # optionally stores the BitSet via static config
    # does not cache a real BitSet unless only some docs match
    # calculates numDocs/numTerms

    {code:java}

    @Override
    protected ByteValues createValue(IndexReader reader, Entry entryKey) throws IOException {
    Entry entry = entryKey;
    String field = entry.field;
    ByteParser parser = (ByteParser) entry.custom;
    if (parser == null) {
    return wrapper.getByteValues(reader, field, FieldCache.DEFAULT_BYTE_PARSER);
    }
    int numDocs = 0;
    int numTerms = 0;
    int maxDoc = reader.maxDoc();
    final byte[] retArray = new byte[maxDoc];
    Bits valid = null;
    Terms terms = MultiFields.getTerms(reader, field);
    if (terms != null) {
    final TermsEnum termsEnum = terms.iterator();
    final Bits delDocs = MultiFields.getDeletedDocs(reader);
    final OpenBitSet validBits = new OpenBitSet( maxDoc );
    DocsEnum docs = null;
    try {
    while(true) {
    final BytesRef term = termsEnum.next();
    if (term == null) {
    break;
    }
    final byte termval = parser.parseByte(term);
    docs = termsEnum.docs(delDocs, docs);
    while (true) {
    final int docID = docs.nextDoc();
    if (docID == DocsEnum.NO_MORE_DOCS) {
    break;
    }
    retArray[docID] = termval;
    validBits.set( docID );
    numDocs++;
    }
    numTerms++;
    }
    } catch (StopFillCacheException stop) {}

    // If all non-deleted docs are valid we don't need the bitset in memory
    if( numDocs > 0 && CachedArray.CACHE_VALID_ARRAY_BITS ) {
    boolean matchesAllDocs = true;
    for( int i=0; i<maxDoc; i++ ) {
    if( !delDocs.get(i) && !validBits.get(i) ) {
    matchesAllDocs = false;
    break;
    }
    }
    if( matchesAllDocs ) {
    valid = new Bits.MatchAllBits( maxDoc );
    }
    else {
    valid = validBits;
    }
    }
    }
    if( numDocs < 1 ) {
    valid = new Bits.MatchNoBits( maxDoc );
    }
    return new ByteValues( retArray, valid, numDocs, numTerms );
    }
    {code}
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 22, 2010 at 2:02 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913347#action_12913347 ]

    Ryan McKinley commented on LUCENE-2649:
    ---------------------------------------

    Any thoughts on this?

    I think the best move forward is to:
    a. optimize as much as possible
    b. drop the no-parser function option
    c. optionally store the bitset via static config (ugly, but lesser of many ugly options)
    d. set lucene default=false (actually I don't care)
    e. set solr default=true

    Unless there are objections, I will clean up the patch, fix javadoc, tests, etc
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 22, 2010 at 2:15 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913353#action_12913353 ]

    Uwe Schindler commented on LUCENE-2649:
    ---------------------------------------

    Also set the Lucene default to true, as I want to improve sorting and FCRF.
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 22, 2010 at 6:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ryan McKinley updated LUCENE-2649:
    ----------------------------------

    Attachment: LUCENE-2649-FieldCacheWithBitSet.patch

    Here is a (hopefully) final patch that adds a bunch of tests to exercise the the 'valid' bits (and check that MatchAll is used when appropriate)
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Sep 22, 2010 at 7:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913428#action_12913428 ]

    Uwe Schindler commented on LUCENE-2649:
    ---------------------------------------

    Hi Ryan,

    few comments:
    - the tests if all/no bits are set and so the special Bits implementations are returned are fine, but the special case for all bits are valid may be a little bit ineffective and seldom
    - please use the correct Java code style ("{" should be at the end of previous line and not in separate line for method declarations), the Eclipse code style is available in Wiki
    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Sep 22, 2010 at 10:20 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913485#action_12913485 ]

    Michael McCandless commented on LUCENE-2649:
    --------------------------------------------

    bq. Also set the Lucene default to true

    Please don't!

    bq. as I want to improve sorting and FCRF.

    But: sorting, FCRF must continue to work if the app chooses not to
    load valid bits, right?

    Other feedback on current patch:

    * We don't have to @Deprecate for 4.0 -- just remove it, and note
    this in MIGRATE.txt. (Though for 3.x we need the deprecation, so
    maybe do 3.x patch first, then remove deprecations for 4.0?).

    * FieldCache.EntryCreator looks orphan'd?

    It looks like the valid bits will not reflect deletions (by design),
    right? Ie caller cannot rely on valid always incorporating deleted
    docs. (Eg the MatchAll opto disregards deletions, and, a reopened
    segment can have new deletions yet shares the FC entry).

    The static config still also bothers me... and, going that route means
    we must agree on a default (which is looking hard!).

    What if we:

    * Allow specifying "loadValidBits" on each load (eg via Parser or
    separate arg to FC.getXXValues), but,

    * We separately cache the valid bits (we'd still have the XXXValues
    returned, to include the valid bits & values).

    This way if an app "messes up", they do not end up double-storing the
    actual values, ie the worst that happens is they have to re-invert
    just to generate the valid bits. Even that should be fairly rare, ie,
    if they use MissingStringLastComparator it'll init both values & valid
    bits entries in the cache on the first go.

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Sep 22, 2010 at 11:30 am
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913503#action_12913503 ]

    Robert Muir commented on LUCENE-2649:
    -------------------------------------

    bq. Also set the Lucene default to true, as I want to improve sorting and FCRF.

    bq. I know it's only 3% (for ints... 12.5% for bytes), but, 3% here, 3% there and suddenly we're talking real money.

    I'm having trouble understanding the use case for this bitset.

    The jira issue says to add a bitset, but doesnt explain why.

    The linked thread talks about this being useful for sorting missing values last, but I don't think this justifies
    increasing the size of fieldcache by default.

    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Ryan McKinley (JIRA) at Sep 22, 2010 at 3:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ryan McKinley updated LUCENE-2649:
    ----------------------------------

    Attachment: LUCENE-2649-FieldCacheWithBitSet.patch

    Here is a new patch that removes the static config. Rather then put a property on Parser class, I added a class:
    {code:java}
    public abstract static class CacheConfig {
    public abstract boolean cacheValidBits();
    }
    {code}
    and this gets passed to the getXXXValues function:
    {code:java}
    ByteValues getByteValues(IndexReader reader, String field, ByteParser parser, CacheConfig config)
    {code}

    I think this is a better option then adding a parameter to Parser since we can have an easy upgrade path. Parser is an interface, so we can not just add to it without breaking compatibility. To change things in 4.x, 3.x should have an upgrade path.

    I took Mike's suggestion and include the CacheConfig hashcode in the Cache key -- however, I don't cache the Bits separately since this is an edge case that *should* be avoided, but at least does not fail if you are not consistent.

    This does cache a MatchAllBits even when 'cacheValidBits' is false, since that is small (a small class with one int)

    -----------

    bq. * We don't have to @Deprecate for 4.0 - just remove it, and note this in MIGRATE.txt. (Though for 3.x we need the deprecation, so maybe do 3.x patch first, then remove deprecations for 4.0?).

    My plan was to apply with deprecations to 4.x, then merge with 3.x. Then replace the calls in 4.x, then remove the old functions. Does this sound reasonable?

    I would like this to get in 3.x since we could then remove many solr types in 4.x and have a 3.x migration path.

    bq. * FieldCache.EntryCreator looks orphan'd?

    dooh, thanks


    bq. It looks like the valid bits will not reflect deletions (by design), right? Ie caller cannot rely on valid always incorporating deleted docs. (Eg the MatchAll opto disregards deletions, and, a reopened segment can have new deletions yet shares the FC entry).

    Right, the ValidBits are only checked for docs that exists (and the FC values are only set for docs that exists -- this has not changed), and may contain false positives for deleted docs. I think this is OK since most use cases (i can think of) deal with deletions anyway. Any ideas how/if we should change this? (I did not realize that the FC is reused after deletions -- so clever)

    ----------------

    bq. I'm having trouble understanding the use case for this bitset.

    My motivation is for supporting the supportMissingLast feature in solr sorting (that could now be pushed to lucene). For example if I have a bunch of documents and only some have the field "bytes" -- sorting 'bytes desc' works great, but sorting 'bytes asc' puts all the documents that do not have the field 'bytes' first since the FieldCache thinks they are all zero.

    If we get this working in solr, we can deprecate and delete all the "sortable" number fields and have that same functionality on Trie* fields.






    FieldCache should include a BitSet for matching docs
    ----------------------------------------------------

    Key: LUCENE-2649
    URL: https://issues.apache.org/jira/browse/LUCENE-2649
    Project: Lucene - Java
    Issue Type: Improvement
    Reporter: Ryan McKinley
    Fix For: 4.0

    Attachments: LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch, LUCENE-2649-FieldCacheWithBitSet.patch


    The FieldCache returns an array representing the values for each doc. However there is no way to know if the doc actually has a value.
    This should be changed to return an object representing the values *and* a BitSet for all valid docs.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedSep 17, '10 at 3:26a
activeSep 27, '10 at 3:48a
posts117
users1
websitelucene.apache.org

1 user in discussion

Uwe Schindler (JIRA): 117 posts

People

Translate

site design / logo © 2021 Grokbase