FAQ
DeletedDocs implement DocIdSet
------------------------------

Key: LUCENE-1476
URL: https://issues.apache.org/jira/browse/LUCENE-1476
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Trivial


DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Jason Rutherglen (JIRA) at Dec 3, 2008 at 10:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Description: SegmentReader.DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily. (was: DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.)
    Summary: SegmentReader.DeletedDocs implement DocIdSet (was: DeletedDocs implement DocIdSet)
    SegmentReader.DeletedDocs implement DocIdSet
    --------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial

    SegmentReader.DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 3, 2008 at 10:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Description: BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable. (was: SegmentReader.DeletedDocs can implement DocIdSet. Then it can be exposed and replaced easily.)
    Remaining Estimate: 12h
    Original Estimate: 12h
    Summary: BitVector implement DocIdSet (was: SegmentReader.DeletedDocs implement DocIdSet)
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 4, 2008 at 12:02 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Attachment: LUCENE-1476.patch

    LUCENE-1476.patch

    BitVector extends DocIdSet.

    TestBitVector implements testDocIdSet method that is based on TestSortedVIntList tests
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 4, 2008 at 12:02 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653070#action_12653070 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 4, 2008 at 12:10 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653075#action_12653075 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Looks like we need a new abstract class. RABitSet?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 4, 2008 at 12:12 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653076#action_12653076 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    BitSet is already random access, DocIdSet is not.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 4, 2008 at 12:18 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653080#action_12653080 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    BitVector does not implement the methods of java.util.BitSet. RABitSet could be implemented by OpenBitSet and BitVector. This way an OpenBitSet or another filter such as P4Delta could be used in place of BitVector in SegmentReader.

    The IndexReader.flush type of methods would need to either automatically not save, throw an exception, and there needs to be a setting. This helps the synchronization issue in SegmentReader.isDeleted by allowing access to it.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 11:32 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653752#action_12653752 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    bq. But, SegmentReader needs random access to the bits (DocIdSet only provides an iterator)?

    Although IndexReader.isDeleted exposes a random-access API to deleted docs, I think it may be overkill.

    Ie, in most (all?) uses of deleted docs throughout Lucene core/contrib, a simple iterator (DocIdSet) would in fact suffice.

    EG in SegmentTermDocs iteration we are always checking deletedDocs by ascending docID. It might be a performance gain (pure speculation) if we used an iterator API, because we could hold "nextDelDocID" and only advance that (skipTo) when the term's docID has moved past it. It's just like an "AND NOT X" clause.

    Similarly, norms, which also now expose a random-access API, should be fine with an iterator type API as well.

    This may also imply better VM behavior, since we don't actually require norms/deletions to be fully memory resident.

    This would be a biggish change, and it's not clear whether/when we should explore it, but I wanted to get the idea out there.

    Marvin, in KS/Lucy are you using random-access or iterator to access deletedDocs & norms?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 5, 2008 at 2:22 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653793#action_12653793 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    I don't think you can change this...

    In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.

    By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).

    I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 5, 2008 at 6:23 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653883#action_12653883 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    Marvin, in KS/Lucy are you using random-access or iterator to access
    deletedDocs & norms?
    Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, but it's implemented using BitVectors which get the next deleted doc num by calling nextSetBit() internally.

    I happened to be coding up those classes this spring when there was the big brouhaha about IndexReader.isDeleted(). It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

    {code}
    i32_t
    MatchAllScorer_next(MatchAllScorer* self)
    {
    do {
    if (++self->doc_num > self->max_docs) {
    self->doc_num--;
    return 0;
    }
    if (self->doc_num > self->next_deletion) {
    self->next_deletion
    = DelEnum_Skip_To(self->del_enum, self->doc_num);
    }
    } while (self->doc_num == self->next_deletion);
    return self->doc_num;
    }
    {code}

    (Note: Scorer.next() in KS returns the document number; doc nums start at 1, and 0 is the sentinel signaling iterator termination. I expect that Lucy will be the same.)

    Perhaps we could get away without needing the random access, but that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted -- unlike Lucene which throws an exception. Also, you can't delete documents against an IndexReader, so Robert's objection doesn't apply to us.

    I had always assumed we were going to have to expose isDeleted() eventually, but maybe we can get away with zapping it. Interesting!

    I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck, now that we've theoretically solved the sort cache warming issue. I figured we would continue to need bit-vector files because they're straightforward to mmap, but if we only need iterator access, we can use vbyte encoding instead... Hmm, we still face the problem of outsized write cost when a segment has a large number of deletions and you add one more...
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 6:45 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653889#action_12653889 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    bq. It seemed wrong to pay the method call overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just store the next deletion:

    Nice! This is what I had in mind.

    I think we could [almost] do this across the board for Lucene.
    SegmentTermDocs would similarly store nextDeleted and apply the same
    "AND NOT" logic.

    bq. that's because IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted

    Hmm -- that is very nicely enabling.

    bq. I've actually been trying to figure out a new design for deletions because writing them out for big segments is our last big write bottleneck

    One approach would be to use a "segmented" model. IE, if a few
    deletions are added, write that to a new "deletes segment", ie a
    single "normal segment" would then have multiple deletion files
    associated with it. These would have to be merged (iterator) when
    used during searching, and, periodically coalesced.

    bq. if we only need iterator access, we can use vbyte encoding instead

    Right: if there are relatively few deletes against a segment, encoding
    the "on bits" directly (or deltas) should be a decent win since
    iteration is much faster.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 6:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653891#action_12653891 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    In many cases after you have read an index, and retrieved document numbers, these are lazily returned to the client.

    By the time some records are needed to be read, they may have already been deleted (at least this was the usage in old lucene, where deletions happened in the reader).

    I think a lot of code assumes this, and calls the isDeleted() to ensure the document is still valid.
    {quote}

    But isn't that an uncommon use case? It's dangerous to wait a long
    time after getting a docID from a reader, before looking up the
    document. Most apps pull the doc right away, send it to the user, and
    the docID isn't kept (I think?).

    But still I agree: we can't eliminate random access to isDeleted
    entirely. We'd still have to offer it for such external cases.

    I'm just saying the internal uses of isDeleted could all be switched
    to iteration instead, and, we might get some performance gains from
    it especially when the number of deletes on a segment is relatively low.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 5, 2008 at 7:57 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653912#action_12653912 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    but IndexReader.document(n) throws an exception if the document is deleted...0 so you still need random access
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 5, 2008 at 8:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653915#action_12653915 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    bq. but IndexReader.document throws an exception if the document is deleted...0 so you still need random access

    Does it really need to throw an exception? (Of course for back compat it does, but we could move away from that to a new method that doesn't check).
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 5, 2008 at 8:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653923#action_12653923 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    It would be great if instead of relying on Lucene to manage the deletedDocs file, the API would be pluggable enough such that a DocIdBitSet (DocIdSet with random access) could be set in a SegmentReader, and the file access (reading and writing) could be managed from outside. Of course this is difficult to do and still make things backwards compatible, however for 3.0 I would *really* like to be a part of efforts to create a completely generic and pluggable API that is cleanly separated from the underlying index format and files. This would mean that the analyzing, querying, scoring portions of Lucene could access an IndexReader like pluggable class where the underlying index files, when and how the index files are written to disk is completely separated.

    One motivation for this patch is to allow custom queries access to the deletedDocs in a clean way (meaning not needing to be a part of the o.a.l.s. package)

    I am wondering if it is good to try to get IndexReader.clone working again, or if there is some other better way related to this patch to externally manage the deletedDocs.

    Improving the performance of deletedDocs would help for every query so it's worth looking at.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 5, 2008 at 9:56 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653939#action_12653939 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    Does it really need to throw an exception?
    Aside from back compat, I don't see why it would need to. I think the only rationale is to serve as a backstop protecting against invalid reads.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • robert engels (JIRA) at Dec 5, 2008 at 10:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653954#action_12653954 ]

    robert engels commented on LUCENE-1476:
    ---------------------------------------

    That's my point, in complex multi-treaded software with multiple readers, etc. it is a good backspot against errors.. :)
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 5, 2008 at 10:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653959#action_12653959 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    It would be great if instead of relying on Lucene to manage the
    deletedDocs file, the API would be pluggable
    In LUCENE-1478, "IndexComponent" was proposed, with potential subclasses including PostingsComponent, LexiconComponent/TermDictComponent, TermVectorsComponent, and so on. Since then, it has become apparent that SnapshotComponent and DeletionsComponent also belong at the top level.

    In Lucy/KS, these would all be specified within a Schema:

    {code}
    class MySchema extends Schema {
    DeletionsComponent deletionsComponent() {
    return new DocIdBitSetDeletionsComponent();
    }

    void initFields() {
    addField("title", "text");
    addField("content", "text");
    }

    Analyzer analyzer() {
    return new PolyAnalyzer("en");
    }
    }
    {code}

    Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does -- just leave that out.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 6, 2008 at 10:20 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654047#action_12654047 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    bq. Mike, you were planning on managing IndexComponents via IndexReader and IndexWriter constructor args, but won't that get unwieldy if there are too many components? A Schema class allows you to group them together. You don't have to use it to manage fields the way KS does - just leave that out.

    Agreed. I'll try to do something along these lines under LUCENE-1458.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 8, 2008 at 12:58 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654269#action_12654269 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    One approach would be to use a "segmented" model.
    That would improve the average performance of deleting a document, at the cost
    of some added complexity. Worst-case performance -- which you'd hit when you
    consolidated those sub-segment deletions files -- would actually degrade a
    bit.

    To manage consolidation, you'd need a deletions merge policy that operated
    independently from the primary merge policy. Aside from the complexity penalty,
    having two un-coordinated merge policies would be bad for real-time search,
    because you want to be able to control exactly when you pay for a big merge.

    I'm also bothered by the proliferation of small deletions files. Probably
    you'd want automatic consolidation of files under 4k, but you still could end
    up with a lot of files in a big index.

    So... what if we wrote, merged, and removed deletions files on the same
    schedule as ordinary segment files? Instead of going back and quasi-modifying
    an existing segment by associating a next-generation .del file with it, we write
    deletions to a NEW segment and have them reference older segments.

    In other words, we add "tombstones" rather than "delete" documents.

    Logically speaking, each tombstone segment file would consist of an array of
    segment identifiers, each of which would point to a "tombstone row" array of
    vbyte-encoded doc nums:

    {code}
    // _6.tombstone
    _2: [3, 4, 25]
    _3: [13]

    // _7.tombstone
    _2: [5]

    // _8.tombstone
    _1: [94]
    _2: [7, 8]
    _5: [54, 55]
    {code}

    The thing that makes this possible is that the dead docs marked by tombstones
    never get their doc nums shuffled during segment merging -- they just
    disappear. If deleted docs lived to be consolidated into new segments and
    acquire new doc nums, tombstones wouldn't work. However, we can associate
    tombstone rows with segment names and they only need remain valid as long
    as the segments they reference survive.

    Some tombstone rows will become obsolete once the segments they reference go
    away, but we never arrive at a scenario where we are forced to discard valid
    tombstones. Merging tombstone files simply involves dropping obsolete
    tombstone rows and collating valid ones.

    At search time, we'd use an iterator with an internal priority queue to
    collate tombstone rows into a stream -- so there's still no need to slurp the
    files at IndexReader startup.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 8, 2008 at 9:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654571#action_12654571 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    I like this approach!!

    It's also incremental in cost (cost of flush/commit is in proportion
    to how many deletes were done), but you are storing the "packet" of
    incremental deletes with the segment you just flushed and not against
    the N segments that had deletes. And you write only one file to hold
    all the tombstones, which for commit() (file sync) is much less cost.

    And it's great that we don't need a new merge policy to handle all the
    delete files.

    Though one possible downside is, for a very large segment in a very
    large index you will likely be merging (at search time) quite a few
    delete packets. But, with the cutover to
    deletes-accessed-only-by-iterator, this cost is probably not high
    until a large pctg of the segment's docs are deleted, at which point
    you should really expungeDeletes() or optimize() or optimize(int)
    anyway.

    If only we could write code as quickly as we can dream...

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 8, 2008 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654592#action_12654592 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Marvin:
    "I'm also bothered by the proliferation of small deletions files. Probably
    you'd want automatic consolidation of files under 4k, but you still could end
    up with a lot of files in a big index."

    A transaction log might be better here if we want to go to 0ish millisecond realtime.
    On Windows at least creating files rapidly and deleting them creates significant IO overhead.
    UNIX is probably faster but I do not know.


    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 8, 2008 at 10:08 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654595#action_12654595 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Wouldn't it be good to remove BitVector and replace it with OpenBitSet? OBS is faster, has the DocIdSetIterator already. It just needs to implement write to disk compression of the bitset (dgaps?). This would be a big win for almost *all* searches. We could also create an interface so that any bitset implementation could be used.

    Such as:
    {code}
    public interface WriteableBitSet {
    public void write(IndexOutput output) throws IOException;
    }
    {code}
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedDec 3, '08 at 10:30p
activeDec 8, '08 at 10:08p
posts24
users1
websitelucene.apache.org

1 user in discussion

Jason Rutherglen (JIRA): 24 posts

People

Translate

site design / logo © 2021 Grokbase