FAQ
DeletedDocs implement DocIdSet
------------------------------

Key: LUCENE-1476
URL: https://issues.apache.org/jira/browse/LUCENE-1476
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Trivial


DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Jason Rutherglen (JIRA) at Dec 3, 2008 at 10:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Description: SegmentReader.DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily. (was: DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.)
    Summary: SegmentReader.DeletedDocs implement DocIdSet (was: DeletedDocs implement DocIdSet)
    SegmentReader.DeletedDocs implement DocIdSet
    --------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial

    SegmentReader.DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 3, 2008 at 10:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Description: BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable. (was: SegmentReader.DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.)
    Remaining Estimate: 12h
    Original Estimate: 12h
    Summary: BitVector implement DocIdSet (was: SegmentReader.DeletedDocs implement DocIdSet)
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 4, 2008 at 12:02 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Attachment: LUCENE-1476.patch

    LUCENE-1476.patch

    BitVector extends DocIdSet.

    TestBitVector implements testDocIdSet method that is based on TestSortedVIntList tests
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 4, 2008 at 12:02 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653070#action_12653070 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 4, 2008 at 12:10 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653075#action_12653075 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Looks like we need a new abstract class. RABitSet?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 4, 2008 at 12:12 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653076#action_12653076 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    BitSet is already random access, DocIdSet is not.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 4, 2008 at 12:18 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653080#action_12653080 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    BitVector does not implement the methods of java.util.BitSet. RABitSet could be implemented by OpenBitSet and BitVector. This way an OpenBitSet or another filter such as P4Delta could be used in place of BitVector in SegmentReader.

    The IndexReader.flush type of methods would need to either automatically not save, throw an exception, and there needs to be a setting. This helps the synchronization issue in SegmentReader.isDeleted by allowing access to it.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 11:32 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653752#action_12653752 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    bq. But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?

    Although IndexReader.isDeleted exposes a random-access API to deleted docs, I think it may be overkill.

    Ie, in most (all?) uses of deleted docs throughout Lucene core/contrib, a simple iterator (DocIdSet) would in fact suffice.

    EG in SegmentTermDocs iteration we are always checking deletedDocs by ascending docID. It might be a performance gain (pure speculation) if we used an iterator API, because we could hold "nextDelDocID" and only advance that (skipTo) when the term's docID has moved past it. It's just like an "AND NOT X" clause.

    Similarly, norms, which also now expose a random-access API, should be fine with an iterator type API as well.

    This may also imply better VM behavior, since we don't actually require norms/deletions to be fully memory resident.

    This would be a biggish change, and it's not clear whether/when we should explore it, but I wanted to get the idea out there.

    Marvin, in KS/Lucy are you using random-access or iterator to access deletedDocs & norms?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 5, 2008 at 2:22 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653793#action_12653793 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    I don't think you can change this...

    In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.

    By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).

    I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 5, 2008 at 6:23 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653883#action_12653883 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    Marvin, in KS/Lucy are you using random-access or iterator to access
    deletedDocs & norms?
    Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented using BitVectors which get the next deleted doc num by calling nextSetBit() internally.

    I happened to be coding up those classes this spring when there was the big brouhaha about IndexReader.isDeleted(). It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

    {code}
    i32_t
    MatchAllScorer_next(MatchAllScorer* self)
    {
    do {
    if (++self->doc_num > self->max_docs) {
    self->doc_num--;
    return 0;
    }
    if (self->doc_num > self->next_deletion) {
    self->next_deletion
    = DelEnum_Skip_To(self->del_enum, self->doc_num);
    }
    } while (self->doc_num == self->next_deletion);
    return self->doc_num;
    }
    {code}

    (Note: Scorer.next() in KS returns the document number; doc nums start at 1, and 0 is the sentinel signaling iterator termination. I expect that Lucy will be the same.)

    Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted -- unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader, so Robert's objection doesn't apply to us.

    I had always assumed we were going to have to expose isDeleted() eventually, but maybe we can get away with zapping it. Interesting!

    I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck, now that we've theoretically solved the sort cache warming issue. I figured we would continue to need bit-vector files because they're straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead... Hmm, we still face the problem of outsized write cost when a segment has a large number of deletions and you add one more...
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 6:45 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653889#action_12653889 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    bq. It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

    Nice! This is what I had in mind.

    I think we could [almost] do this across the board for Lucene.
    SegmentTermDocs would similarly store nextDeleted and apply the same
    "AND NOT" logic.

    bq. that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted

    Hmm -- that is very nicely enabling.

    bq. I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck

    One approach would be to use a "segmented" model. IE, if a few
    deletions are added, write that to a new "deletes segment", ie a
    single "normal segment" would then have multiple deletion files
    associated with it. These would have to be merged (iterator) when
    used during searching, and, periodically coalesced.

    bq. if we only need iterator access, we can use vbyte encoding instead

    Right: if there are relatively few deletes against a segment, encoding
    the "on bits" directly (or deltas) should be a decent win since
    iteration is much faster.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 6:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653891#action_12653891 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.

    By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).

    I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.
    {quote}

    But isn't that an uncommon use case? It's dangerous to wait a long
    time after getting a docID from a reader, before looking up the
    document. Most apps pull the doc right away, send it to the user, and
    the docID isn't kept (I think?).

    But still I agree: we can't eliminate random access to isDeleted
    entirely. We'd still have to offer it for such external cases.

    I'm just saying the internal uses of isDeleted could all be switched
    to iteration instead, and, we might get some performance gains from
    it especially when the number of deletes on a segment is relatively low.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 5, 2008 at 7:57 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653912#action_12653912 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    but IndexReader.document(n) throws an exception if the document is deleted...0 so you still need random access
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 8:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653915#action_12653915 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    bq. but IndexReader.document throws an exception if the document is deleted...0 so you still need random access

    Does it really need to throw an exception? (Of course for back compat it does, but we could move away from that to a new method that doesn't check).
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 5, 2008 at 8:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653923#action_12653923 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    It would be great if instead of relying on Lucene to manage the deletedDocs file, the API would be pluggable enough such that a DocIdBitSet (DocIdSet with random access) could be set in a SegmentReader, and the file access (reading and writing) could be managed from outside. Of course this is difficult to do and still make things backwards compatible, however for 3.0 I would *really* like to be a part of efforts to create a completely generic and pluggable API that is cleanly separated from the underlying index format and files. This would mean that the analyzing, querying, scoring portions of Lucene could access an IndexReader like pluggable class where the underlying index files, when and how the index files are written to disk is completely separated.

    One motivation for this patch is to allow custom queries access to the deletedDocs in a clean way (meaning not needing to be a part of the o.a.l.s. package)

    I am wondering if it is good to try to get IndexReader.clone working again, or if there is some other better way related to this patch to externally manage the deletedDocs.

    Improving the performance of deletedDocs would help for every query so it's worth looking at.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 5, 2008 at 9:56 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653939#action_12653939 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    Does it really need to throw an exception?
    Aside from back compat, I don't see why it would need to. I think the only rationale is to serve as a backstop protecting against invalid reads.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 5, 2008 at 10:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653954#action_12653954 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    That's my point, in complex multi-treaded software with multiple readers, etc. it is a good backspot against errors.. :)
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 5, 2008 at 10:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653959#action_12653959 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    It would be great if instead of relying on Lucene to manage the
    deletedDocs file, the API would be pluggable
    In LUCENE-1478, "IndexComponent" was proposed, with potential subclasses including PostingsComponent, LexiconComponent/TermDictComponent, TermVectorsComponent, and so on. Since then, it has become apparent that SnapshotComponent and DeletionsComponent also belong at the top level.

    In Lucy/KS, these would all be specified within a Schema:

    {code}
    class MySchema extends Schema {
    DeletionsComponent deletionsComponent() {
    return new DocIdBitSetDeletionsComponent();
    }

    void initFields() {
    addField("title", "text");
    addField("content", "text");
    }

    Analyzer analyzer() {
    return new PolyAnalyzer("en");
    }
    }
    {code}

    Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does -- just leave that out.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 6, 2008 at 10:20 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654047#action_12654047 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    bq. Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does - just leave that out.

    Agreed. I'll try to do something along these lines under LUCENE-1458.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 8, 2008 at 12:58 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654269#action_12654269 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    One approach would be to use a "segmented" model.
    That would improve the average performance of deleting a document, at the cost
    of some added complexity. Worst-case performance -- which you'd hit when you
    consolidated those sub-segment deletions files -- would actually degrade a
    bit.

    To manage consolidation, you'd need a deletions merge policy that operated
    independently from the primary merge policy. Aside from the complexity penalty,
    having two un-coordinated merge policies would be bad for real-time search,
    because you want to be able to control exactly when you pay for a big merge.

    I'm also bothered by the proliferation of small deletions files. Probably
    you'd want automatic consolidation of files under 4k, but you still could end
    up with a lot of files in a big index.

    So... what if we wrote, merged, and removed deletions files on the same
    schedule as ordinary segment files? Instead of going back and quasi-modifying
    an existing segment by associating a next-generation .del file with it, we write
    deletions to a NEW segment and have them reference older segments.

    In other words, we add "tombstones" rather than "delete" documents.

    Logically speaking, each tombstone segment file would consist of an array of
    segment identifiers, each of which would point to a "tombstone row" array of
    vbyte-encoded doc nums:

    {code}
    // _6.tombstone
    _2: [3, 4, 25]
    _3: [13]

    // _7.tombstone
    _2: [5]

    // _8.tombstone
    _1: [94]
    _2: [7, 8]
    _5: [54, 55]
    {code}

    The thing that makes this possible is that the dead docs marked by tombstones
    never get their doc nums shuffled during segment merging -- they just
    disappear. If deleted docs lived to be consolidated into new segments and
    acquire new doc nums, tombstones wouldn't work. However, we can associate
    tombstone rows with segment names and they only need remain valid as long
    as the segments they reference survive.

    Some tombstone rows will become obsolete once the segments they reference go
    away, but we never arrive at a scenario where we are forced to discard valid
    tombstones. Merging tombstone files simply involves dropping obsolete
    tombstone rows and collating valid ones.

    At search time, we'd use an iterator with an internal priority queue to
    collate tombstone rows into a stream -- so there's still no need to slurp the
    files at IndexReader startup.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 8, 2008 at 9:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654571#action_12654571 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    I like this approach!!

    It's also incremental in cost (cost of flush/commit is in proportion
    to how many deletes were done), but you are storing the "packet" of
    incremental deletes with the segment you just flushed and not against
    the N segments that had deletes. And you write only one file to hold
    all the tombstones, which for commit() (file sync) is much less cost.

    And it's great that we don't need a new merge policy to handle all the
    delete files.

    Though one possible downside is, for a very large segment in a very
    large index you will likely be merging (at search time) quite a few
    delete packets. But, with the cutover to
    deletes-accessed-only-by-iterator, this cost is probably not high
    until a large pctg of the segment's docs are deleted, at which point
    you should really expungeDeletes() or optimize() or optimize(int)
    anyway.

    If only we could write code as quickly as we can dream...

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 8, 2008 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654592#action_12654592 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Marvin:
    "I'm also bothered by the proliferation of small deletions files. Probably
    you'd want automatic consolidation of files under 4k, but you still could end
    up with a lot of files in a big index."

    A transaction log might be better here if we want to go to 0ish millisecond realtime.
    On Windows at least creating files rapidly and deleting them creates significant IO overhead.
    UNIX is probably faster but I do not know.


    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 8, 2008 at 10:08 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654595#action_12654595 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Wouldn't it be good to remove BitVector and replace it with OpenBitSet? OBS is faster, has the DocIdSetIterator already. It just needs to implement write to disk compression of the bitset (dgaps?). This would be a big win for almost *all* searches. We could also create an interface so that any bitset implementation could be used.

    Such as:
    {code}
    public interface WriteableBitSet {
    public void write(IndexOutput output) throws IOException;
    }
    {code}
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 12:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661786#action_12661786 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared.

    It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones. With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work.

    To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

    {code}
    i32_t
    MatchAllScorer_next(MatchAllScorer* self)
    {
    if (++self->doc_num > self->max_docs) {
    self->doc_num--;
    return 0;
    }
    return self->doc_num;
    }
    {code}

    {code}
    void
    Scorer_collect(Scorer *self, Hitcollector *collector, DelEnum *del_enum)
    {
    i32_t next_deletion = del_enum ? 0 : I32_MAX;
    i32_t doc_num = 1;
    while (1) {
    while (doc_num >= next_deletion) {
    next_deletion = DelEnum_Skip_To(del_enum, target);
    while (doc_num == next_deletion) {
    doc_num++;
    next_deletion = DelEnum_Skip_To(del_enum, doc_num);
    }
    }
    doc_num = Scorer_Skip_To(scorer, doc_num);
    if (doc_num) {
    HC_Collect(collector, doc_num, Scorer_Tally(scorer));
    }
    else {
    break;
    }
    }
    }
    {code}

    The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers.

    I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy.

    Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? As soon as we do away with random access, we have to keep state. Dunno if it's going to be noticeable, but it's conceptually annoying.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 8, 2009 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661814#action_12661814 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    I like this idea of tombstones and we should figure out a way to support it. This issue https://issues.apache.org/jira/browse/LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.
    I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right?
    Yes that is true.

    For realtime search where a new transaction may only have a handful of deletes the tombstones may not be optimal because too many tombstones would accumulate (I believe). For this scenario rolling bitsets may be better. Meaning pool bit sets and throw away unused readers.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 2:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661829#action_12661829 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Jason Rutherglen:
    For realtime search where a new transaction may only have a handful of
    deletes the tombstones may not be optimal
    The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed.

    When you make deletes with the BitSet model, you have to rewrite files that scale with segment size, regardless of how few deletions you make. Deletion of a single document in a large segment may necessitate writing out a substantial bit vector file.

    In contrast, i/o throughput for writing out a tombstone file scales with the number of tombstones.
    because too many tombstones would accumulate (I believe).
    Say that you make a string of commits that are nothing but deleting a single document -- thus adding a new segment each time that contains nothing but a single tombstone. Those are going to be cheap to merge, so it seems unlikely that we'll end up with an unwieldy number of tombstone streams to interleave at search-time.

    The more likely problem is the one McCandless articulated regarding a large segment accumulating a lot of tombstone streams against it. But I agree with him that it only gets truly serious if your merge policy neglects such segments and allows them to deteriorate for too long.
    For this scenario rolling bitsets may be better. Meaning pool bit sets and
    throw away unused readers.
    I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Jan 8, 2009 at 7:52 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661880#action_12661880 ]

    Paul Elschot commented on LUCENE-1476:
    --------------------------------------

    Jason,

    bq. This issue LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.

    The only functional difference between the DocIdSetIterator as committed and the Matcher class at 584 is the Matcher.explain() method, which did not make it into DocIdSetIterator.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 10:59 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661934#action_12661934 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:
    {quote}

    This is a neat idea! Deletions are then applied just like a Filter.

    For a TermQuery (one term) the cost of the two approaches should be
    the same.

    For OR'd Term queries, it actually seems like your proposed approach
    may be lower cost? Ie rather than each TermEnum doing the "AND NOT
    deleted" intersection, you only do it once at the top. There is added
    cost in that each TermEnum is now returning more docIDs than before,
    but the deleted ones are eliminated before scoring.

    For AND (and other) queries I'm not sure. In theory, having to
    process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
    may see much higher net cost. We should test.

    Conceivably, a future "search optimization phase" could pick & choose
    the best point to inject the "AND NOT deleted" filter. In fact, it
    could also pick when to inject a Filter... a costly per-docID search
    with a very restrictive filter could be far more efficient if you
    applied the Filter earlier in the chain.

    I'm also curious what cost you see of doing the merge sort for every
    search; I think it could be uncomfortably high since it's so
    hard-for-cpu-to-predict-branch-intensive. We could take the first
    search that doesn't use skipTo and save the result of the merge sort,
    essentially doing an in-RAM-only "merge" of those deletes, and let
    subsequent searches use that single merged stream. (This is not MMAP
    friendly, though).

    In my initial rough testing, I switched to iterator API for
    SegmentTermEnum and found if %tg deletes is < 10% the search was a bit
    faster using an iterator vs random access, but above that was slower.
    This was with an already "merged" list of in-order docIDs.

    Switching to an iterator API for accessing field values for many docs
    (LUCENE-831 -- new FieldCache API, LUCENE-1231 -- column stride
    fields) shouldn't have this same problem since it's the "top level"
    that's accessing the values (ie, one iterator per field X query).


    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Jan 8, 2009 at 11:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661944#action_12661944 ]

    Paul Elschot commented on LUCENE-1476:
    --------------------------------------

    bq. To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level.

    How about a SegmentSearcher?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 2:33 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661977#action_12661977 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Paul Elschot:
    How about a SegmentSearcher?
    I like the idea of a SegmentSearcher in general. A little while back, I wondered whether exposing SegmentReaders was really the best way to handle segment-centric search. Upon reflection, I think it is. Segments are a good unit. They're pure inverted indexes (notwithstanding doc stores and tombstones); the larger composite only masquerades as one.

    I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

    And then, to respond to the original suggestion, the SegmentSearcher level seems like a good place to handle application of a deletions quasi-filter. I think we could avoid having to deal with segment-start offsets that way.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 2:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661982#action_12661982 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?

    You can do that now by implementing BitVector.nextSetBit(int tick) and using that in TermDocs to set a nextDeletion member var instead of checking every doc num with BitVector.get().

    That way, the object that provides deletions can still be shared.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey at Jan 8, 2009 at 10:20 pm

    You can do that now by implementing BitVector.nextSetBit(int tick) and using
    that in TermDocs to set a nextDeletion member var instead of checking every
    doc num with BitVector.get().
    This seems so easy, I should take a crack at it. :)

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 3:19 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661995#action_12661995 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    For a TermQuery (one term) the cost of the two approaches should be
    the same.
    It'll be close, but I don't think that's quite true. TermScorer pre-fetches
    document numbers in batches from the TermDocs object. At present, only
    non-deleted doc nums get cached. If we move the deletions filtering up, then
    we'd increase traffic through that cache. However, filling it would be
    slightly cheaper, because we wouldn't be performing the deletions check.

    In theory. I'm not sure there's a way to streamline away that deletions check
    in TermDocs and maintain backwards compatibility. And while this is a fun
    brainstorm, I'm still far from convinced that having TermDocs.next() and
    Scorer.next() return deleted docs by default is a good idea.
    For AND (and other) queries I'm not sure. In theory, having to
    process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
    may see much higher net cost.
    If you were applying deletions filtering after Scorer.next(), then it seems
    likely that costs would go up because of extra hit processing. However, if
    you use Scorer.skipTo() to jump past deletions, as in the loop I provided
    above, then PhraseScorer etc. shouldn't incur any more costs themselves.
    a costly per-docID search
    with a very restrictive filter could be far more efficient if you
    applied the Filter earlier in the chain.
    Under the skipTo() loop, I think the filter effectively *does* get applied
    earlier in the chain. Does that make sense?

    I think the potential performance downside comes down to prefetching in
    TermScorer, unless there are other classes that do similar prefetching.



    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Jan 8, 2009 at 3:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661998#action_12661998 ]

    Mark Miller commented on LUCENE-1476:
    -------------------------------------

    bq. I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

    Thats a fairly old version I think (based on using MutliSearcher as a hack). Now we are using one queue and running it through each subreader of the MultiReader.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 8, 2009 at 4:47 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662033#action_12662033 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Marvin: "The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed."

    It does improve the write speed. I found in making the realtime search write speed fast enough that writing to individual files per segment can become too costly (they accumulate fast, appending to a single file is faster than creating new files, deleting the files becomes costly). For example, writing to small individual files per commit, if the number of segments is large and the delete spans multiple segments will generate many files. This is variable based on how often the updates are expected to occur. I modeled this after the extreme case of the frequency of updates of a MySQL instance backing data for a web application.

    The MySQL design, translated to Lucene is a transaction log per index. Where the updates consisting of documents and deletes are written to the transaction log file. If Lucene crashed for some reason the transaction log would be replayed. The in memory indexes and newly deleted document bitvectors would be held in RAM (LUCENE-1314) until flushed (the in memory indexes and deleted documents) manually or based on memory usage. Many users may not want a transaction log as they may be storing the updates in a separate SQL database instance (this is the case where I work) and so a transaction log is redundant and should be optional. The first implementation of this will not have a transaction log.

    Marvin: "I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?"

    It would be exposed as a combination reader writer that manages the transaction status of each update. The internal architecture is such that after each update a new reader representing the new documents and deletes for the transaction is generated and put onto a stack. The reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no references to it, or it's has N number of readers ahead of it).
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 7:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662065#action_12662065 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Jason Rutherglen:
    I found in making the realtime search write speed fast enough that writing
    to individual files per segment can become too costly (they accumulate fast,
    appending to a single file is faster than creating new files, deleting the
    files becomes costly).
    I saw you mentioning i/o overhead on Windows in particular. I can't see a way
    to mod Lucene so that it doesn't generate a bunch of files for each commit,
    and FWIW Lucy/KS is going to generate even more files than Lucene.

    Half-seriously... how about writing a single-file Directory implementation?
    For example, writing to small individual files per commit, if the number of
    segments is large and the delete spans multiple segments will generate many
    files.
    There would be a maximum of two files per segment to hold the tombstones: one
    to hold the tombstone rows, and one to map segment identifiers to tombstone
    rows. (In Lucy/KS, the mappings would probably be stored in the JSON-encoded
    "segmeta" file, which stores human-readable metadata on behalf of multiple
    components.)

    Segments containing tombstones would be merged according to whatever merge
    policy was in place. So there won't ever be an obscene number of tombstone
    files unless you allow an obscene number of segments to accumulate.
    Many users may not want a transaction log as they may be storing the updates
    in a separate SQL database instance (this is the case where I work) and so a
    transaction log is redundant and should be optional.
    I can see how this would be quite useful at the application level. However, I
    think it might be challenging to generalize the transaction log concept at the
    library level:

    {code}
    CustomAnalyzer analyzer = new CustomAnalyzer();
    IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index");
    indexWriter.add(nextDoc());
    analyzer.setFoo(2); // change of state not recorded by transaction log
    indexWriter.add(nextDoc());
    {code}

    MySQL is more of a closed system than Lucene, which I think makes options
    available that aren't available to us.
    The reader stack is drained based on whether a reader is too old to be
    useful anymore (i.e. no references to it, or it's has N number of readers
    ahead of it).
    Right, this is the kind of thing that Lucene has to do because of the
    single-reader model, and that were trying to get away from in Lucy/KS by
    exploiting mmap and making IndexReaders cheap wrappers around the system i/o
    cache.

    I don't think I can offer any alternative design suggestions that meet your
    needs. There's going to be a change rate that overwhelms the multi-file
    commit system, and it seems that you've determined you're up against it.

    What's killing us is something different: not absolute change rate, but poor
    worst-case performance.

    FWIW, we contemplated a multi-index system with an index on a RAM disk for
    fast changes and a primary index on the main file system. It would have
    worked fine for pure adds, but it was very tricky to manage state for
    documents which were being "updated", i.e. deleted and re-added. How are you
    handling all these small adds with your combo reader/writer? Do you not have
    that problem?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 7:45 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662089#action_12662089 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    There's going to be a change rate that overwhelms the multi-file
    commit system, and it seems that you've determined you're up against
    it.
    {quote}

    Well... IndexWriter need not "commit" in order to allow a reader to
    see the files?

    Commit is for crash recovery, and for knowing when it's OK to delete
    prior commits. Simply writing the files (and not syncing them), and
    perhaps giving IndexReader.open the SegmentInfos to use directly (and
    not writing a segments_N via the filesystem) would allow us to search
    added docs without paying the cost of sync'ing all the files.

    Also: brand new, tiny segments should be written into a RAMDirectory
    and then merged over time into the real Directory.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 7:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662092#action_12662092 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    If Lucene crashed for some reason the transaction log would be replayed.
    {quote}

    I think the transaction log is useful for some applications, but could
    (should) be built as a separate (optional) layer entirely on top of
    Lucene's core. Ie, neither IndexWriter nor IndexReader need to be
    aware of the transaction log, which update belongs to which
    transaction, etc?

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 8:01 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662097#action_12662097 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    It would be exposed as a combination reader writer that manages the transaction status of each update.
    {quote}

    I think the transactions layer would also sit on top of this
    "realtime" layer? EG this "realtime" layer would expose a commit()
    method, and the transaction layer above it would maintain the
    transaction log, periodically calling commit() and truncating the
    transaction log?

    This "realtime" layer, then, would internally maintain a single
    IndexWriter and the readers. IndexWriter would flush (not commit) new
    segments into a RAMDir and yield its in-RAM SegmentInfos to
    IndexReader.reopen. MergePolicy periodically gets those into the real
    Directory. When reopening a reader we have the freedom to use old
    (already merged away) segments if the newly merged segment isn't warm
    yet.

    We "just" need to open some things up in IndexWriter:

    * IndexReader.reopen with the in-RAM SegmentInfos

    * Willingness to allow an IndexReader to maintain & updated deleted
    docs even though IndexWriter has the write lock

    * Access to segments that were already merged away (I think we could
    make a DeletionPolicy that pays attention to when the newly merged
    segment is not yet warmed and keeps thue prior segments around).
    I think this'd require allowing DeletionPolicy to see "flush
    points" in addition to commit points (it doesn't today).

    But I'm still hazy on the details on exactly how to open up
    IndexWriter.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 8:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662100#action_12662100 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    I'm also curious what cost you see of doing the merge sort for every
    search; I think it could be uncomfortably high since it's so
    hard-for-cpu-to-predict-branch-intensive.
    Probably true. You're going to get accelerating degradation as the number of
    deletions increases. In a large index, you could end up merging 20, 30
    streams. Based on how the priority queue in ORScorer tends to take up space
    in profiling data, that might not be good.

    It'd be manageable if you can keep your index reasonably in good shape, but you'll
    be suckin' pondwater if it gets flabby.
    We could take the first search that doesn't use skipTo and save the result
    of the merge sort, essentially doing an in-RAM-only "merge" of those
    deletes, and let subsequent searches use that single merged stream.
    That was what I had in mind when proposing the pseudo-iterator model.

    {code}
    class TombStoneDelEnum extends DelEnum {
    int nextDeletion(int docNum) {
    while (currentMax < docNum) { nextInternal(); }
    return bits.nextSetBit(docNum);
    }
    // ...
    }
    {code}
    (This is not MMAP friendly, though).
    Yeah. Ironically, that use of tombstones is more compatible with the Lucene
    model. :-)

    I'd be reluctant to have Lucy/KS realize those large BitVectors in per-object
    process RAM. That'd spoil the "cheap wrapper around system i/o cache"
    IndexReader plan.

    I can't see an answer yet. But the one thing I do know is that Lucy/KS needs
    a pluggable deletions mechanism to make experimentation easier -- so that's
    what I'm working on today.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 8:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662101#action_12662101 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    If we move the deletions filtering up, then we'd increase traffic through that cache
    {quote}

    OK, right. So we may have some added cost because of this. I think
    it's only TermScorer that uses the bulk API though.

    {quote}
    If you were applying deletions filtering after Scorer.next(), then it seems
    likely that costs would go up because of extra hit processing. However, if
    you use Scorer.skipTo() to jump past deletions, as in the loop I provided
    above, then PhraseScorer etc. shouldn't incur any more costs themselves.
    {quote}

    Ahhh, now I got it! Good, you're right.

    {quote}
    Under the skipTo() loop, I think the filter effectively does get applied
    earlier in the chain. Does that make sense?
    {quote}

    Right. This is how Lucene works today. Excellent.

    So, net/net it seems like "deletes-as-a-filter" approach is compelling?

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 8:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662102#action_12662102 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?
    {quote}
    That works if under-the-hood it's a non-sparse representation. But if it's sparse, you need an iterator (state) to remember where you are.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 8:29 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662107#action_12662107 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    Commit is for crash recovery, and for knowing when it's OK to delete
    prior commits. Simply writing the files (and not syncing them), and
    perhaps giving IndexReader.open the SegmentInfos to use directly (and
    not writing a segments_N via the filesystem) would allow us to search
    added docs without paying the cost of sync'ing all the files.
    Mmm. I think I might have given IndexWriter.commit() slightly different
    semantics. Specifically, I might have given it a boolean "sync" argument
    which defaults to false.
    Also: brand new, tiny segments should be written into a RAMDirectory
    and then merged over time into the real Directory.
    Two comments. First, if you don't sync, but rather leave it up to the OS when
    it wants to actually perform the actual disk i/o, how expensive is flushing? Can
    we make it cheap enough to meet Jason's absolute change rate requirements?

    Second, the multi-index model is very tricky when dealing with "updates". How
    do you guarantee that you always see the "current" version of a given
    document, and only that version? When do you expose new deletes in the
    RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
    manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
    new adds to the RAMDirectory that take place during slow merges...

    Building a single-index, two-writer model that could handle fast updates while
    performing background merging was one of the main drivers behind the tombstone
    design.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jan 8, 2009 at 10:57 pm
    The way we've simplified this that every document has an OID. It
    simplifies updates and delete tracking (in the transaction log).
    On Jan 8, 2009, at 2:28 PM, Marvin Humphrey (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/LUCENE-1476?
    page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
    tabpanel&focusedCommentId=12662107#action_12662107 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    Commit is for crash recovery, and for knowing when it's OK to delete
    prior commits. Simply writing the files (and not syncing them), and
    perhaps giving IndexReader.open the SegmentInfos to use directly (and
    not writing a segments_N via the filesystem) would allow us to search
    added docs without paying the cost of sync'ing all the files.
    Mmm. I think I might have given IndexWriter.commit() slightly
    different
    semantics. Specifically, I might have given it a boolean "sync"
    argument
    which defaults to false.
    Also: brand new, tiny segments should be written into a RAMDirectory
    and then merged over time into the real Directory.
    Two comments. First, if you don't sync, but rather leave it up to
    the OS when
    it wants to actually perform the actual disk i/o, how expensive is
    flushing? Can
    we make it cheap enough to meet Jason's absolute change rate
    requirements?

    Second, the multi-index model is very tricky when dealing with
    "updates". How
    do you guarantee that you always see the "current" version of a given
    document, and only that version? When do you expose new deletes in
    the
    RAMDirectory, when do you expose new deletes in the FSDirectory,
    how do you
    manage slow merges from the RAMDirectory to the FSDirectory, how do
    you manage
    new adds to the RAMDirectory that take place during slow merges...

    Building a single-index, two-writer model that could handle fast
    updates while
    performing background merging was one of the main drivers behind
    the tombstone
    design.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/
    LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making
    SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 8:35 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662110#action_12662110 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    if it's sparse, you need an iterator (state) to remember where you are.
    We can hide the sparse representation and the internal state, having the
    object lazily build the a non-sparse representation. That's what I had in
    mind with the code for TombstoneDelEnum.nextDeletion().
    TombstoneDelEnum.nextInternal() would be a private method used for building up
    the internal BitVector.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 10:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662143#action_12662143 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    So, net/net it seems like "deletes-as-a-filter" approach is compelling?
    In terms of CPU-cycles, maybe.

    My gut tells me that it's all but mandatory if we use merged-on-the-fly
    tombstone streams, but if Lucene goes that route it should cache a BitVector
    and use a shared pseudo-iterator -- in which case the costs will no longer be
    significantly more than the current system.

    Under the current system, I'm not certain that the deletions checks are that
    excessive. Consider this loop from TermDocs.read():

    {code}
    while (i < length && count < df) {
    // manually inlined call to next() for speed
    final int docCode = freqStream.readVInt();
    doc += docCode >>> 1; // shift off low bit
    if ((docCode & 1) != 0) // if low bit is set
    freq = 1; // freq is one
    else
    freq = freqStream.readVInt(); // else read freq
    count++;

    if (deletedDocs == null || !deletedDocs.get(doc)) {
    docs[i] = doc;
    freqs[i] = freq;
    ++i;
    }
    }
    {code}

    The CPU probably does a good job of predicting the result of the null check on
    deletedDocs. The readVInt() method call is already a pipeline killer.

    Here's how that loop looks after I patch the deletions check for pseudo-iteration.

    {code}
    while (i < length && count < df) {
    // manually inlined call to next() for speed
    final int docCode = freqStream.readVInt();
    doc += docCode >>> 1; // shift off low bit
    if ((docCode & 1) != 0) // if low bit is set
    freq = 1; // freq is one
    else
    freq = freqStream.readVInt(); // else read freq
    count++;

    if (docNum >= nextDeletion) {
    if (docNum > nextDeletion) {
    nextDeletion = deletedDocs.nextSetBit(docNum);
    }
    if (docNum == nextDeletion) {
    continue;
    }
    }

    docs[i] = doc;
    freqs[i] = freq;
    ++i;
    }
    return i;
    }
    {code}

    Again, the CPU is probably going to do a pretty good job of predicting the
    results of the deletion check. And even then, we're accessing the same shared
    BitVector across all TermDocs, and its bits are hopefully a cache hit.

    To really tighten this loop, you have to do what Nate and I want with Lucy/KS:

    * Remove all function/method call overhead.
    * Operate directly on the memory mapped postings file.

    {code}
    SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs,
    u32_t request)
    {
    i32_t doc_num = self->doc_num;
    const u32_t remaining = self->doc_freq - self->count;
    const u32_t num_got = request < remaining ? request : remaining;
    char *buf = InStream_Buf(instream, C32_MAX_BYTES * num_got);
    u32_t i;

    for (i = 0; i < num_got; i++) {
    u32_t doc_code = Math_decode_c32(&buf); /* static inline function */
    u32_t freq = (doc_code & 1) ? 1 : Math_decode_c32(&buf);
    doc_num += doc_code >> 1;
    doc_nums[i] = doc_num;
    freqs[i] = freq;
    ++i;
    }

    InStream_Advance_Buf(instream, buf);
    self->doc_num = doc_num;
    self->count += num_got;

    return num_got;
    }
    {code}

    (That loop would be even better using PFOR instead of vbyte.)

    In terms of public API, I don't think it's reasonable to change Lucene's
    Scorer and TermDocs classes so that their iterators start returning deleted
    docs.

    We could potentially make that choice with Lucy/KS, thus allowing us to remove
    the deletions check in the PostingList iterator (as above) and getting a
    potential speedup. But even then I hesitate to push the deletions API upwards
    into a space where users of raw Scorer and TermDocs classes have to deal with
    it -- especially since iterator-style deletions aren't very user-friendly.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 9, 2009 at 1:41 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662214#action_12662214 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    M.M.:" I think the transactions layer would also sit on top of this
    "realtime" layer? EG this "realtime" layer would expose a commit()
    method, and the transaction layer above it would maintain the
    transaction log, periodically calling commit() and truncating the
    transaction log?"

    One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs.

    {code}
    IndexWriter iw = new IndexWriter();
    Transaction tr = iw.createTransaction();
    tr.addDocument(new Document());
    tr.addDocument(new Document());
    tr.deleteDocument(1200l);
    IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a ram index)
    IndexReader latestReader = iw.getReader(); // same as ir
    iw.commit(boolean doWait); // commits the in ram realtime index to disk
    {code}

    When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down.

    M.H.: "how about writing a single-file Directory implementation?"

    I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime.

    M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version?

    Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay.

    M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory"

    When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..."

    Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir.

    One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening.

    When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter.

    M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when
    it wants to actually perform the actual disk i/o, how expensive is flushing? Can
    we make it cheap enough to meet Jason's absolute change rate requirements?"

    When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS.

    M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock"

    In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 9, 2009 at 3:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Marvin Humphrey updated LUCENE-1476:
    ------------------------------------

    Attachment: quasi_iterator_deletions.diff

    Here's a patch implementing BitVector.nextSetBit() and converting
    SegmentTermDocs over to use the quasi-iterator style. Tested but
    not benchmarked.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Doug Cutting (JIRA) at Jan 9, 2009 at 5:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662247#action_12662247 ]

    Doug Cutting commented on LUCENE-1476:
    --------------------------------------

    bq. To really tighten this loop, you have to [ ... ] remove all function/method call overhead [and] operate directly on the memory mapped postings file.

    That sounds familiar...

    http://svn.apache.org/viewvc/lucene/java/trunk/src/gcj/org/apache/lucene/index/GCJTermDocs.cc?view=markup

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 9, 2009 at 12:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662339#action_12662339 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    Mmm. I think I might have given IndexWriter.commit() slightly different
    semantics. Specifically, I might have given it a boolean "sync" argument
    which defaults to false.
    {quote}

    It seems like there are various levels of increasing "durability" here:

    * Make available to a reader in the same JVM (eg flush new segment
    to a RAMDir) -- not exposed today.

    * Make available to a reader sharing the filesystem right now (flush
    to Directory in "real" filesystem, but don't sync) -- exposed
    today (but deprecated as a public API) as flush.

    * Make available to readers even if OS/machine crashes (flush to
    Directory in "real" filesystem, and sync) -- exposed today as commit.

    {quote}
    Two comments. First, if you don't sync, but rather leave it up to the OS when
    it wants to actually perform the actual disk i/o, how expensive is flushing? Can
    we make it cheap enough to meet Jason's absolute change rate requirements?
    {quote}

    Right I've been wondering the same thing. I think this should be our
    first approach to realtime indexing, and then we swap in RAMDir if
    performance is not good enough.

    {quote}
    Second, the multi-index model is very tricky when dealing with "updates". How
    do you guarantee that you always see the "current" version of a given
    document, and only that version? When do you expose new deletes in the
    RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
    manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
    new adds to the RAMDirectory that take place during slow merges...

    Building a single-index, two-writer model that could handle fast updates while
    performing background merging was one of the main drivers behind the tombstone
    design.
    {quote}

    I'm not proposing multi-index model (at least I think I'm not!). A
    single IW could flush new tiny segments into a RAMDir and later merge
    them into a real Dir. But I agree: let's start w/ a single Dir and
    move to RAMDir only if necessary.

    {quote}
    Building a single-index, two-writer model that could handle fast updates while
    performing background merging was one of the main drivers behind the tombstone
    design.
    {quote}

    I think carrying the deletions in RAM (reopening the reader) is
    probably fastest for Lucene. Lucene with the "reopened stream of
    readers" approach can do this, but Lucy/KS (with mmap) must use
    filesystem as the intermediary.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedDec 3, '08 at 10:30p
activeJan 30, '09 at 10:47p
posts109
users5
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase