FAQ
[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661786#action_12661786 ]

Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------

While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared.

It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones. With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work.

To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

{code}
i32_t
MatchAllScorer_next(MatchAllScorer* self)
{
if (++self->doc_num > self->max_docs) {
self->doc_num--;
return 0;
}
return self->doc_num;
}
{code}

{code}
void
Scorer_collect(Scorer *self, Hitcollector *collector, DelEnum *del_enum)
{
i32_t next_deletion = del_enum ? 0 : I32_MAX;
i32_t doc_num = 1;
while (1) {
while (doc_num >= next_deletion) {
next_deletion = DelEnum_Skip_To(del_enum, target);
while (doc_num == next_deletion) {
doc_num++;
next_deletion = DelEnum_Skip_To(del_enum, doc_num);
}
}
doc_num = Scorer_Skip_To(scorer, doc_num);
if (doc_num) {
HC_Collect(collector, doc_num, Scorer_Tally(scorer));
}
else {
break;
}
}
}
{code}

The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers.

I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy.

Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? As soon as we do away with random access, we have to keep state. Dunno if it's going to be noticeable, but it's conceptually annoying.
BitVector implement DocIdSet
----------------------------

Key: LUCENE-1476
URL: https://issues.apache.org/jira/browse/LUCENE-1476
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Trivial
Attachments: LUCENE-1476.patch

Original Estimate: 12h
Remaining Estimate: 12h

BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Jason Rutherglen (JIRA) at Jan 8, 2009 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661814#action_12661814 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    I like this idea of tombstones and we should figure out a way to support it. This issue https://issues.apache.org/jira/browse/LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.
    I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right?
    Yes that is true.

    For realtime search where a new transaction may only have a handful of deletes the tombstones may not be optimal because too many tombstones would accumulate (I believe). For this scenario rolling bitsets may be better. Meaning pool bit sets and throw away unused readers.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 2:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661829#action_12661829 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Jason Rutherglen:
    For realtime search where a new transaction may only have a handful of
    deletes the tombstones may not be optimal
    The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed.

    When you make deletes with the BitSet model, you have to rewrite files that scale with segment size, regardless of how few deletions you make. Deletion of a single document in a large segment may necessitate writing out a substantial bit vector file.

    In contrast, i/o throughput for writing out a tombstone file scales with the number of tombstones.
    because too many tombstones would accumulate (I believe).
    Say that you make a string of commits that are nothing but deleting a single document -- thus adding a new segment each time that contains nothing but a single tombstone. Those are going to be cheap to merge, so it seems unlikely that we'll end up with an unwieldy number of tombstone streams to interleave at search-time.

    The more likely problem is the one McCandless articulated regarding a large segment accumulating a lot of tombstone streams against it. But I agree with him that it only gets truly serious if your merge policy neglects such segments and allows them to deteriorate for too long.
    For this scenario rolling bitsets may be better. Meaning pool bit sets and
    throw away unused readers.
    I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jan 8, 2009 at 3:28 am
    Why not just write the first byte as 0 for a bitsit, and 1 for a
    sparse bit set (compressed), and make the determination when writing
    based on the segment size and/or number of set bits.
    On Jan 7, 2009, at 8:38 PM, Marvin Humphrey (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/LUCENE-1476?
    page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
    tabpanel&focusedCommentId=12661829#action_12661829 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Jason Rutherglen:
    For realtime search where a new transaction may only have a
    handful of
    deletes the tombstones may not be optimal
    The whole tombstone idea arose out of the need for (close to)
    realtime search! It's intended to improve write speed.

    When you make deletes with the BitSet model, you have to rewrite
    files that scale with segment size, regardless of how few deletions
    you make. Deletion of a single document in a large segment may
    necessitate writing out a substantial bit vector file.

    In contrast, i/o throughput for writing out a tombstone file scales
    with the number of tombstones.
    because too many tombstones would accumulate (I believe).
    Say that you make a string of commits that are nothing but deleting
    a single document -- thus adding a new segment each time that
    contains nothing but a single tombstone. Those are going to be
    cheap to merge, so it seems unlikely that we'll end up with an
    unwieldy number of tombstone streams to interleave at search-time.

    The more likely problem is the one McCandless articulated regarding
    a large segment accumulating a lot of tombstone streams against
    it. But I agree with him that it only gets truly serious if your
    merge policy neglects such segments and allows them to deteriorate
    for too long.
    For this scenario rolling bitsets may be better. Meaning pool bit
    sets and
    throw away unused readers.
    I don't think I understand. Is this the "combination index reader/
    writer" model, where the writer prepares a data structure that then
    gets handed off to the reader?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/
    LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making
    SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey at Jan 8, 2009 at 4:29 am

    On Wed, Jan 07, 2009 at 09:28:40PM -0600, robert engels wrote:
    Why not just write the first byte as 0 for a bitsit, and 1 for a
    sparse bit set (compressed), and make the determination when writing
    based on the segment size and/or number of set bits.
    Are you offering that as a solution to the problem I described here?
    When you make deletes with the BitSet model, you have to rewrite
    files that scale with segment size, regardless of how few deletions
    you make. Deletion of a single document in a large segment may
    necessitate writing out a substantial bit vector file.

    In contrast, i/o throughput for writing out a tombstone file scales
    with the number of tombstones.
    Worst-case i/o costs don't improve under such a regime. You could still end
    up writing a large, uncompressed bit vector file to accommodate a single
    deletion.

    I suppose that has to be weighed against the search-time costs of interleaving
    the tombstone streams. We can either pay the interleaving penalty at
    index-time or search-time. It's annoying to write out a 1 MB uncompressed bit
    vector file for a single deleted doc against an 8-million doc segment, but if
    there are enough deletions to justify an uncompressed file, iterating through
    them via merged-on-the-fly tombstone streams would be annoying too.

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jan 8, 2009 at 4:36 am
    Yes, and I don't think the "worst-case" is correct.

    When you go to write that segment, you determine that it is a "large"
    segment, but has few deletions (one in this case), it will be written
    compressed in probably less than 10 bytes (1 byte header, vlong
    start, vint length - you only write the ones...)...

    You would not write the file as uncompressed in this case. The only
    time you would write a uncompressed bitset is when you determine it
    is large with many deletions, or it is very small (more efficient to
    just write the bytes quickly). The very small case should probably
    be less than the size "standard" disk block (which is probably 32k
    these days, meaning 256k documents).
    On Jan 7, 2009, at 10:28 PM, Marvin Humphrey wrote:
    On Wed, Jan 07, 2009 at 09:28:40PM -0600, robert engels wrote:
    Why not just write the first byte as 0 for a bitsit, and 1 for a
    sparse bit set (compressed), and make the determination when writing
    based on the segment size and/or number of set bits.
    Are you offering that as a solution to the problem I described here?
    When you make deletes with the BitSet model, you have to rewrite
    files that scale with segment size, regardless of how few deletions
    you make. Deletion of a single document in a large segment may
    necessitate writing out a substantial bit vector file.

    In contrast, i/o throughput for writing out a tombstone file scales
    with the number of tombstones.
    Worst-case i/o costs don't improve under such a regime. You could
    still end
    up writing a large, uncompressed bit vector file to accommodate a
    single
    deletion.

    I suppose that has to be weighed against the search-time costs of
    interleaving
    the tombstone streams. We can either pay the interleaving penalty at
    index-time or search-time. It's annoying to write out a 1 MB
    uncompressed bit
    vector file for a single deleted doc against an 8-million doc
    segment, but if
    there are enough deletions to justify an uncompressed file,
    iterating through
    them via merged-on-the-fly tombstone streams would be annoying too.

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey at Jan 8, 2009 at 6:04 am

    On Wed, Jan 07, 2009 at 10:36:01PM -0600, robert engels wrote:
    Yes, and I don't think the "worst-case" is correct.

    When you go to write that segment, you determine that it is a "large"
    segment, but has few deletions (one in this case), it will be written
    compressed in probably less than 10 bytes (1 byte header, vlong
    start, vint length - you only write the ones...)...
    If a segment slowly accumulates deletions over time during different indexing
    sessions, at some point you will cross the threshold where the deletions file
    needs to be written out as an uncompressed bit vector. From then on, adding a
    single additional deletion to the segment during any subsequent indexing
    session triggers what I'm calling "worst-case" behavior: the whole bit vector
    file needs to be rewritten for the sake of a one deletion.

    Marvin Humphrey

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jan 8, 2009 at 6:18 am
    Then why not always write segment.delXXXX, where XXXX is
    incremented. Each file may be compressed or uncompressed based on
    the number of deletions it contains.

    It should not matter in high performance "server mode" cases, since
    the deletion data for the segment should be in memory anyway, so no
    need to read all of the files (during merging). They are only there
    for crash recovery situations.

    If you are using lucene in the "shared networking" mode, then
    performance is probably not a huge concern anyway, so reading the
    multiple files when opening the segment is the penalty.

    Or the policy can be pluggable and the "shared" can use the old
    bitset method.
    On Jan 8, 2009, at 12:04 AM, Marvin Humphrey wrote:
    On Wed, Jan 07, 2009 at 10:36:01PM -0600, robert engels wrote:
    Yes, and I don't think the "worst-case" is correct.

    When you go to write that segment, you determine that it is a "large"
    segment, but has few deletions (one in this case), it will be written
    compressed in probably less than 10 bytes (1 byte header, vlong
    start, vint length - you only write the ones...)...
    If a segment slowly accumulates deletions over time during
    different indexing
    sessions, at some point you will cross the threshold where the
    deletions file
    needs to be written out as an uncompressed bit vector. From then
    on, adding a
    single additional deletion to the segment during any subsequent
    indexing
    session triggers what I'm calling "worst-case" behavior: the whole
    bit vector
    file needs to be rewritten for the sake of a one deletion.

    Marvin Humphrey

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless at Jan 8, 2009 at 10:35 am

    robert engels wrote:

    Then why not always write segment.delXXXX, where XXXX is incremented.
    This is what Lucene does today. It's "write once".
    Each file may be compressed or uncompressed based on the number of
    deletions it contains.
    Lucene also does this.

    Still, as Marvin pointed out, the cost of committing a delete is in
    proportion to either the number of deletes already on the segment (if
    written sparse) or the number of documents in the segment (if written
    non-sparse). It doesn't scale well... though the constant factor may
    be very small (ie may not matter that much in practice?). With
    tombstones the commit cost would be in proportion to how many deletes
    you did (scales perfectly), at the expense of added per-search cost
    and search iterator state.

    For realtime search this could be a good tradeoff to make (lower
    latency on add/delete -> refreshed searcher, at higher per-search
    cost), but... in the realtime search discussion we are now thinking
    that the deletes live with the reader and are carried in RAM over to
    the reopened reader (LUCENE-1314), bypassing having to commit to the
    filesystem at all.

    One downside to this is it's single-JRE only, ie to do distributed
    realtime search you'd have to also re-apply the deletes to the head
    IndexReader on each JRE. (Whereas added docs would be written with a
    single IndexWriter, and propagated via the filesystem ).

    If we go forward with this model then indeed slowish commit times for
    new deletes are less important since it's for crash recovery and not
    for opening a new reader.

    But we'd have many "control" issues to work through... eg how the
    reader can re-open against old segments right after a new merge is
    committed (because the newly merged segment isn't warmed yet), and,
    how IndexReader can open segments written by the writer but not truly
    committed (sync'd).

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Jan 8, 2009 at 7:52 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661880#action_12661880 ]

    Paul Elschot commented on LUCENE-1476:
    --------------------------------------

    Jason,

    bq. This issue LUCENE-584 had an implementation of a "matcher" class that the scorers implemented which I do not think made it into the committed patch.

    The only functional difference between the DocIdSetIterator as committed and the Matcher class at 584 is the Matcher.explain() method, which did not make it into DocIdSetIterator.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 10:59 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661934#action_12661934 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:
    {quote}

    This is a neat idea! Deletions are then applied just like a Filter.

    For a TermQuery (one term) the cost of the two approaches should be
    the same.

    For OR'd Term queries, it actually seems like your proposed approach
    may be lower cost? Ie rather than each TermEnum doing the "AND NOT
    deleted" intersection, you only do it once at the top. There is added
    cost in that each TermEnum is now returning more docIDs than before,
    but the deleted ones are eliminated before scoring.

    For AND (and other) queries I'm not sure. In theory, having to
    process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
    may see much higher net cost. We should test.

    Conceivably, a future "search optimization phase" could pick & choose
    the best point to inject the "AND NOT deleted" filter. In fact, it
    could also pick when to inject a Filter... a costly per-docID search
    with a very restrictive filter could be far more efficient if you
    applied the Filter earlier in the chain.

    I'm also curious what cost you see of doing the merge sort for every
    search; I think it could be uncomfortably high since it's so
    hard-for-cpu-to-predict-branch-intensive. We could take the first
    search that doesn't use skipTo and save the result of the merge sort,
    essentially doing an in-RAM-only "merge" of those deletes, and let
    subsequent searches use that single merged stream. (This is not MMAP
    friendly, though).

    In my initial rough testing, I switched to iterator API for
    SegmentTermEnum and found if %tg deletes is < 10% the search was a bit
    faster using an iterator vs random access, but above that was slower.
    This was with an already "merged" list of in-order docIDs.

    Switching to an iterator API for accessing field values for many docs
    (LUCENE-831 -- new FieldCache API, LUCENE-1231 -- column stride
    fields) shouldn't have this same problem since it's the "top level"
    that's accessing the values (ie, one iterator per field X query).


    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Jan 8, 2009 at 11:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661944#action_12661944 ]

    Paul Elschot commented on LUCENE-1476:
    --------------------------------------

    bq. To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level.

    How about a SegmentSearcher?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 2:33 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661977#action_12661977 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Paul Elschot:
    How about a SegmentSearcher?
    I like the idea of a SegmentSearcher in general. A little while back, I wondered whether exposing SegmentReaders was really the best way to handle segment-centric search. Upon reflection, I think it is. Segments are a good unit. They're pure inverted indexes (notwithstanding doc stores and tombstones); the larger composite only masquerades as one.

    I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

    And then, to respond to the original suggestion, the SegmentSearcher level seems like a good place to handle application of a deletions quasi-filter. I think we could avoid having to deal with segment-start offsets that way.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 2:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661982#action_12661982 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?

    You can do that now by implementing BitVector.nextSetBit(int tick) and using that in TermDocs to set a nextDeletion member var instead of checking every doc num with BitVector.get().

    That way, the object that provides deletions can still be shared.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey at Jan 8, 2009 at 10:20 pm

    You can do that now by implementing BitVector.nextSetBit(int tick) and using
    that in TermDocs to set a nextDeletion member var instead of checking every
    doc num with BitVector.get().
    This seems so easy, I should take a crack at it. :)

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 3:19 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661995#action_12661995 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    For a TermQuery (one term) the cost of the two approaches should be
    the same.
    It'll be close, but I don't think that's quite true. TermScorer pre-fetches
    document numbers in batches from the TermDocs object. At present, only
    non-deleted doc nums get cached. If we move the deletions filtering up, then
    we'd increase traffic through that cache. However, filling it would be
    slightly cheaper, because we wouldn't be performing the deletions check.

    In theory. I'm not sure there's a way to streamline away that deletions check
    in TermDocs and maintain backwards compatibility. And while this is a fun
    brainstorm, I'm still far from convinced that having TermDocs.next() and
    Scorer.next() return deleted docs by default is a good idea.
    For AND (and other) queries I'm not sure. In theory, having to
    process more docIDs is more costly, eg a PhraseQuery or SpanXXXQuery
    may see much higher net cost.
    If you were applying deletions filtering after Scorer.next(), then it seems
    likely that costs would go up because of extra hit processing. However, if
    you use Scorer.skipTo() to jump past deletions, as in the loop I provided
    above, then PhraseScorer etc. shouldn't incur any more costs themselves.
    a costly per-docID search
    with a very restrictive filter could be far more efficient if you
    applied the Filter earlier in the chain.
    Under the skipTo() loop, I think the filter effectively *does* get applied
    earlier in the chain. Does that make sense?

    I think the potential performance downside comes down to prefetching in
    TermScorer, unless there are other classes that do similar prefetching.



    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Jan 8, 2009 at 3:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661998#action_12661998 ]

    Mark Miller commented on LUCENE-1476:
    -------------------------------------

    bq. I noticed that in one version of the patch for segment-centric search (LUCENE-1483), each sorted search involved the creation of sub-searchers, which were then used to compile Scorers. It would make sense to cache those as individual SegmentSearcher objects, no?

    Thats a fairly old version I think (based on using MutliSearcher as a hack). Now we are using one queue and running it through each subreader of the MultiReader.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 8, 2009 at 4:47 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662033#action_12662033 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Marvin: "The whole tombstone idea arose out of the need for (close to) realtime search! It's intended to improve write speed."

    It does improve the write speed. I found in making the realtime search write speed fast enough that writing to individual files per segment can become too costly (they accumulate fast, appending to a single file is faster than creating new files, deleting the files becomes costly). For example, writing to small individual files per commit, if the number of segments is large and the delete spans multiple segments will generate many files. This is variable based on how often the updates are expected to occur. I modeled this after the extreme case of the frequency of updates of a MySQL instance backing data for a web application.

    The MySQL design, translated to Lucene is a transaction log per index. Where the updates consisting of documents and deletes are written to the transaction log file. If Lucene crashed for some reason the transaction log would be replayed. The in memory indexes and newly deleted document bitvectors would be held in RAM (LUCENE-1314) until flushed (the in memory indexes and deleted documents) manually or based on memory usage. Many users may not want a transaction log as they may be storing the updates in a separate SQL database instance (this is the case where I work) and so a transaction log is redundant and should be optional. The first implementation of this will not have a transaction log.

    Marvin: "I don't think I understand. Is this the "combination index reader/writer" model, where the writer prepares a data structure that then gets handed off to the reader?"

    It would be exposed as a combination reader writer that manages the transaction status of each update. The internal architecture is such that after each update a new reader representing the new documents and deletes for the transaction is generated and put onto a stack. The reader stack is drained based on whether a reader is too old to be useful anymore (i.e. no references to it, or it's has N number of readers ahead of it).
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 7:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662065#action_12662065 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Jason Rutherglen:
    I found in making the realtime search write speed fast enough that writing
    to individual files per segment can become too costly (they accumulate fast,
    appending to a single file is faster than creating new files, deleting the
    files becomes costly).
    I saw you mentioning i/o overhead on Windows in particular. I can't see a way
    to mod Lucene so that it doesn't generate a bunch of files for each commit,
    and FWIW Lucy/KS is going to generate even more files than Lucene.

    Half-seriously... how about writing a single-file Directory implementation?
    For example, writing to small individual files per commit, if the number of
    segments is large and the delete spans multiple segments will generate many
    files.
    There would be a maximum of two files per segment to hold the tombstones: one
    to hold the tombstone rows, and one to map segment identifiers to tombstone
    rows. (In Lucy/KS, the mappings would probably be stored in the JSON-encoded
    "segmeta" file, which stores human-readable metadata on behalf of multiple
    components.)

    Segments containing tombstones would be merged according to whatever merge
    policy was in place. So there won't ever be an obscene number of tombstone
    files unless you allow an obscene number of segments to accumulate.
    Many users may not want a transaction log as they may be storing the updates
    in a separate SQL database instance (this is the case where I work) and so a
    transaction log is redundant and should be optional.
    I can see how this would be quite useful at the application level. However, I
    think it might be challenging to generalize the transaction log concept at the
    library level:

    {code}
    CustomAnalyzer analyzer = new CustomAnalyzer();
    IndexWriter indexWriter = new IndexWriter(analyzer, "/path/to/index");
    indexWriter.add(nextDoc());
    analyzer.setFoo(2); // change of state not recorded by transaction log
    indexWriter.add(nextDoc());
    {code}

    MySQL is more of a closed system than Lucene, which I think makes options
    available that aren't available to us.
    The reader stack is drained based on whether a reader is too old to be
    useful anymore (i.e. no references to it, or it's has N number of readers
    ahead of it).
    Right, this is the kind of thing that Lucene has to do because of the
    single-reader model, and that were trying to get away from in Lucy/KS by
    exploiting mmap and making IndexReaders cheap wrappers around the system i/o
    cache.

    I don't think I can offer any alternative design suggestions that meet your
    needs. There's going to be a change rate that overwhelms the multi-file
    commit system, and it seems that you've determined you're up against it.

    What's killing us is something different: not absolute change rate, but poor
    worst-case performance.

    FWIW, we contemplated a multi-index system with an index on a RAM disk for
    fast changes and a primary index on the main file system. It would have
    worked fine for pure adds, but it was very tricky to manage state for
    documents which were being "updated", i.e. deleted and re-added. How are you
    handling all these small adds with your combo reader/writer? Do you not have
    that problem?
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 7:45 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662089#action_12662089 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    There's going to be a change rate that overwhelms the multi-file
    commit system, and it seems that you've determined you're up against
    it.
    {quote}

    Well... IndexWriter need not "commit" in order to allow a reader to
    see the files?

    Commit is for crash recovery, and for knowing when it's OK to delete
    prior commits. Simply writing the files (and not syncing them), and
    perhaps giving IndexReader.open the SegmentInfos to use directly (and
    not writing a segments_N via the filesystem) would allow us to search
    added docs without paying the cost of sync'ing all the files.

    Also: brand new, tiny segments should be written into a RAMDirectory
    and then merged over time into the real Directory.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 7:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662092#action_12662092 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    If Lucene crashed for some reason the transaction log would be replayed.
    {quote}

    I think the transaction log is useful for some applications, but could
    (should) be built as a separate (optional) layer entirely on top of
    Lucene's core. Ie, neither IndexWriter nor IndexReader need to be
    aware of the transaction log, which update belongs to which
    transaction, etc?

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 8:01 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662097#action_12662097 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    It would be exposed as a combination reader writer that manages the transaction status of each update.
    {quote}

    I think the transactions layer would also sit on top of this
    "realtime" layer? EG this "realtime" layer would expose a commit()
    method, and the transaction layer above it would maintain the
    transaction log, periodically calling commit() and truncating the
    transaction log?

    This "realtime" layer, then, would internally maintain a single
    IndexWriter and the readers. IndexWriter would flush (not commit) new
    segments into a RAMDir and yield its in-RAM SegmentInfos to
    IndexReader.reopen. MergePolicy periodically gets those into the real
    Directory. When reopening a reader we have the freedom to use old
    (already merged away) segments if the newly merged segment isn't warm
    yet.

    We "just" need to open some things up in IndexWriter:

    * IndexReader.reopen with the in-RAM SegmentInfos

    * Willingness to allow an IndexReader to maintain & updated deleted
    docs even though IndexWriter has the write lock

    * Access to segments that were already merged away (I think we could
    make a DeletionPolicy that pays attention to when the newly merged
    segment is not yet warmed and keeps thue prior segments around).
    I think this'd require allowing DeletionPolicy to see "flush
    points" in addition to commit points (it doesn't today).

    But I'm still hazy on the details on exactly how to open up
    IndexWriter.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 8:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662100#action_12662100 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    I'm also curious what cost you see of doing the merge sort for every
    search; I think it could be uncomfortably high since it's so
    hard-for-cpu-to-predict-branch-intensive.
    Probably true. You're going to get accelerating degradation as the number of
    deletions increases. In a large index, you could end up merging 20, 30
    streams. Based on how the priority queue in ORScorer tends to take up space
    in profiling data, that might not be good.

    It'd be manageable if you can keep your index reasonably in good shape, but you'll
    be suckin' pondwater if it gets flabby.
    We could take the first search that doesn't use skipTo and save the result
    of the merge sort, essentially doing an in-RAM-only "merge" of those
    deletes, and let subsequent searches use that single merged stream.
    That was what I had in mind when proposing the pseudo-iterator model.

    {code}
    class TombStoneDelEnum extends DelEnum {
    int nextDeletion(int docNum) {
    while (currentMax < docNum) { nextInternal(); }
    return bits.nextSetBit(docNum);
    }
    // ...
    }
    {code}
    (This is not MMAP friendly, though).
    Yeah. Ironically, that use of tombstones is more compatible with the Lucene
    model. :-)

    I'd be reluctant to have Lucy/KS realize those large BitVectors in per-object
    process RAM. That'd spoil the "cheap wrapper around system i/o cache"
    IndexReader plan.

    I can't see an answer yet. But the one thing I do know is that Lucy/KS needs
    a pluggable deletions mechanism to make experimentation easier -- so that's
    what I'm working on today.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 8:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662101#action_12662101 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    If we move the deletions filtering up, then we'd increase traffic through that cache
    {quote}

    OK, right. So we may have some added cost because of this. I think
    it's only TermScorer that uses the bulk API though.

    {quote}
    If you were applying deletions filtering after Scorer.next(), then it seems
    likely that costs would go up because of extra hit processing. However, if
    you use Scorer.skipTo() to jump past deletions, as in the loop I provided
    above, then PhraseScorer etc. shouldn't incur any more costs themselves.
    {quote}

    Ahhh, now I got it! Good, you're right.

    {quote}
    Under the skipTo() loop, I think the filter effectively does get applied
    earlier in the chain. Does that make sense?
    {quote}

    Right. This is how Lucene works today. Excellent.

    So, net/net it seems like "deletes-as-a-filter" approach is compelling?

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 8, 2009 at 8:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662102#action_12662102 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    How about if we model deletions-as-iterator on BitSet.nextSetBit(int tick) instead of a true iterator that keeps state?
    {quote}
    That works if under-the-hood it's a non-sparse representation. But if it's sparse, you need an iterator (state) to remember where you are.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 8:29 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662107#action_12662107 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    Commit is for crash recovery, and for knowing when it's OK to delete
    prior commits. Simply writing the files (and not syncing them), and
    perhaps giving IndexReader.open the SegmentInfos to use directly (and
    not writing a segments_N via the filesystem) would allow us to search
    added docs without paying the cost of sync'ing all the files.
    Mmm. I think I might have given IndexWriter.commit() slightly different
    semantics. Specifically, I might have given it a boolean "sync" argument
    which defaults to false.
    Also: brand new, tiny segments should be written into a RAMDirectory
    and then merged over time into the real Directory.
    Two comments. First, if you don't sync, but rather leave it up to the OS when
    it wants to actually perform the actual disk i/o, how expensive is flushing? Can
    we make it cheap enough to meet Jason's absolute change rate requirements?

    Second, the multi-index model is very tricky when dealing with "updates". How
    do you guarantee that you always see the "current" version of a given
    document, and only that version? When do you expose new deletes in the
    RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
    manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
    new adds to the RAMDirectory that take place during slow merges...

    Building a single-index, two-writer model that could handle fast updates while
    performing background merging was one of the main drivers behind the tombstone
    design.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jan 8, 2009 at 10:57 pm
    The way we've simplified this that every document has an OID. It
    simplifies updates and delete tracking (in the transaction log).
    On Jan 8, 2009, at 2:28 PM, Marvin Humphrey (JIRA) wrote:


    [ https://issues.apache.org/jira/browse/LUCENE-1476?
    page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
    tabpanel&focusedCommentId=12662107#action_12662107 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    Commit is for crash recovery, and for knowing when it's OK to delete
    prior commits. Simply writing the files (and not syncing them), and
    perhaps giving IndexReader.open the SegmentInfos to use directly (and
    not writing a segments_N via the filesystem) would allow us to search
    added docs without paying the cost of sync'ing all the files.
    Mmm. I think I might have given IndexWriter.commit() slightly
    different
    semantics. Specifically, I might have given it a boolean "sync"
    argument
    which defaults to false.
    Also: brand new, tiny segments should be written into a RAMDirectory
    and then merged over time into the real Directory.
    Two comments. First, if you don't sync, but rather leave it up to
    the OS when
    it wants to actually perform the actual disk i/o, how expensive is
    flushing? Can
    we make it cheap enough to meet Jason's absolute change rate
    requirements?

    Second, the multi-index model is very tricky when dealing with
    "updates". How
    do you guarantee that you always see the "current" version of a given
    document, and only that version? When do you expose new deletes in
    the
    RAMDirectory, when do you expose new deletes in the FSDirectory,
    how do you
    manage slow merges from the RAMDirectory to the FSDirectory, how do
    you manage
    new adds to the RAMDirectory that take place during slow merges...

    Building a single-index, two-writer model that could handle fast
    updates while
    performing background merging was one of the main drivers behind
    the tombstone
    design.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/
    LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making
    SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 8:35 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662110#action_12662110 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    if it's sparse, you need an iterator (state) to remember where you are.
    We can hide the sparse representation and the internal state, having the
    object lazily build the a non-sparse representation. That's what I had in
    mind with the code for TombstoneDelEnum.nextDeletion().
    TombstoneDelEnum.nextInternal() would be a private method used for building up
    the internal BitVector.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 8, 2009 at 10:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662143#action_12662143 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    So, net/net it seems like "deletes-as-a-filter" approach is compelling?
    In terms of CPU-cycles, maybe.

    My gut tells me that it's all but mandatory if we use merged-on-the-fly
    tombstone streams, but if Lucene goes that route it should cache a BitVector
    and use a shared pseudo-iterator -- in which case the costs will no longer be
    significantly more than the current system.

    Under the current system, I'm not certain that the deletions checks are that
    excessive. Consider this loop from TermDocs.read():

    {code}
    while (i < length && count < df) {
    // manually inlined call to next() for speed
    final int docCode = freqStream.readVInt();
    doc += docCode >>> 1; // shift off low bit
    if ((docCode & 1) != 0) // if low bit is set
    freq = 1; // freq is one
    else
    freq = freqStream.readVInt(); // else read freq
    count++;

    if (deletedDocs == null || !deletedDocs.get(doc)) {
    docs[i] = doc;
    freqs[i] = freq;
    ++i;
    }
    }
    {code}

    The CPU probably does a good job of predicting the result of the null check on
    deletedDocs. The readVInt() method call is already a pipeline killer.

    Here's how that loop looks after I patch the deletions check for pseudo-iteration.

    {code}
    while (i < length && count < df) {
    // manually inlined call to next() for speed
    final int docCode = freqStream.readVInt();
    doc += docCode >>> 1; // shift off low bit
    if ((docCode & 1) != 0) // if low bit is set
    freq = 1; // freq is one
    else
    freq = freqStream.readVInt(); // else read freq
    count++;

    if (docNum >= nextDeletion) {
    if (docNum > nextDeletion) {
    nextDeletion = deletedDocs.nextSetBit(docNum);
    }
    if (docNum == nextDeletion) {
    continue;
    }
    }

    docs[i] = doc;
    freqs[i] = freq;
    ++i;
    }
    return i;
    }
    {code}

    Again, the CPU is probably going to do a pretty good job of predicting the
    results of the deletion check. And even then, we're accessing the same shared
    BitVector across all TermDocs, and its bits are hopefully a cache hit.

    To really tighten this loop, you have to do what Nate and I want with Lucy/KS:

    * Remove all function/method call overhead.
    * Operate directly on the memory mapped postings file.

    {code}
    SegPList_bulk_read(SegPostingList *self, i32_t *doc_nums, i32_t *freqs,
    u32_t request)
    {
    i32_t doc_num = self->doc_num;
    const u32_t remaining = self->doc_freq - self->count;
    const u32_t num_got = request < remaining ? request : remaining;
    char *buf = InStream_Buf(instream, C32_MAX_BYTES * num_got);
    u32_t i;

    for (i = 0; i < num_got; i++) {
    u32_t doc_code = Math_decode_c32(&buf); /* static inline function */
    u32_t freq = (doc_code & 1) ? 1 : Math_decode_c32(&buf);
    doc_num += doc_code >> 1;
    doc_nums[i] = doc_num;
    freqs[i] = freq;
    ++i;
    }

    InStream_Advance_Buf(instream, buf);
    self->doc_num = doc_num;
    self->count += num_got;

    return num_got;
    }
    {code}

    (That loop would be even better using PFOR instead of vbyte.)

    In terms of public API, I don't think it's reasonable to change Lucene's
    Scorer and TermDocs classes so that their iterators start returning deleted
    docs.

    We could potentially make that choice with Lucy/KS, thus allowing us to remove
    the deletions check in the PostingList iterator (as above) and getting a
    potential speedup. But even then I hesitate to push the deletions API upwards
    into a space where users of raw Scorer and TermDocs classes have to deal with
    it -- especially since iterator-style deletions aren't very user-friendly.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 9, 2009 at 1:41 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662214#action_12662214 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    M.M.:" I think the transactions layer would also sit on top of this
    "realtime" layer? EG this "realtime" layer would expose a commit()
    method, and the transaction layer above it would maintain the
    transaction log, periodically calling commit() and truncating the
    transaction log?"

    One approach that may be optimal is to expose from IndexWriter a createTransaction method that accepts new documents and deletes. All documents have an associated UID. The new documents could feasibly be encoded into a single segment that represents the added documents for that transaction. The deletes would be represented as document long UIDs rather than int doc ids. Then the commit method would be called on the transaction object who returns a reader representing the latest version of the index, plus the changes created by the transaction. This system would be a part of IndexWriter and would not rely on a transaction log. IndexWriter.commit would flush the in ram realtime indexes to disk. The realtime merge policy would flush based on the RAM usage or number of docs.

    {code}
    IndexWriter iw = new IndexWriter();
    Transaction tr = iw.createTransaction();
    tr.addDocument(new Document());
    tr.addDocument(new Document());
    tr.deleteDocument(1200l);
    IndexReader ir = tr.flush(); // flushes transaction to the index (probably to a ram index)
    IndexReader latestReader = iw.getReader(); // same as ir
    iw.commit(boolean doWait); // commits the in ram realtime index to disk
    {code}

    When commit is called, the disk segment reader flush their deletes to disk which is fast. The in ram realtime index is merged to disk. The process is described in more detail further down.

    M.H.: "how about writing a single-file Directory implementation?"

    I'm not sure we need this because and appending rolling transaction log should work. Segments don't change, only things like norms and deletes which can be appended to a rolling transaction log file system. If we had a generic transaction logging system, the future column stride fields, deletes, norms, and future realtime features could use it and be realtime.

    M.H.: "How do you guarantee that you always see the "current" version of a given document, and only that version?

    Each transaction returns an IndexReader. Each "row" or "object" could use a unique id in the transaction log model which would allow documents that were merged into other segments to be deleted during a transaction log replay.

    M.H.: "When do you expose new deletes in the RAMDirectory, when do you expose new deletes in the FSDirectory"

    When do you expose new deletes in the RAMDir, when do you expose new deletes in the FSDirectory, how do you manage slow merges from the RAMDir to the FSDirectory, how do you manage new adds to the RAMDir that take place during slow merges..."

    Queue deletes to the RAMDir, while copying the RAMDir to the FSDir in the background, perform the deletes after the copy is completed, then instantiate a new reader with the newly merged FSDirectory and a new RAMDirs. Writes that were occurring during this process would be happening to another new RAMDir.

    One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM. Each transaction is an instance of some segments and their deletes (and we're not worried about the deletes being flushed or not so assume they exist as BitVectors). The system should expose an API to checkpoint/flush at a given transaction level (usually the current) and should not stop new updates from happening.

    When I wrote this type of system, I managed individual segments outside of IndexWriter's merge policy and performed the merging manually by placing each segment in it's own FSDirectory (the segment size was 64MB) which minimized the number of directories. I do not know the best approach for this when performed within IndexWriter.

    M.H.: "Two comments. First, if you don't sync, but rather leave it up to the OS when
    it wants to actually perform the actual disk i/o, how expensive is flushing? Can
    we make it cheap enough to meet Jason's absolute change rate requirements?"

    When I tried out the transaction log a write usually mapped pretty quickly to a hard disk write. I don't think it's safe to leave writes up to the OS.

    M.M.: "maintain & updated deleted docs even though IndexWriter has the write lock"

    In my previous realtime search implementation I got around this by having each segment in it's own directory. Assuming this is non-optimal, we will need to expose an IndexReader that has the writelock of the IndexWriter.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 9, 2009 at 3:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Marvin Humphrey updated LUCENE-1476:
    ------------------------------------

    Attachment: quasi_iterator_deletions.diff

    Here's a patch implementing BitVector.nextSetBit() and converting
    SegmentTermDocs over to use the quasi-iterator style. Tested but
    not benchmarked.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Doug Cutting (JIRA) at Jan 9, 2009 at 5:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662247#action_12662247 ]

    Doug Cutting commented on LUCENE-1476:
    --------------------------------------

    bq. To really tighten this loop, you have to [ ... ] remove all function/method call overhead [and] operate directly on the memory mapped postings file.

    That sounds familiar...

    http://svn.apache.org/viewvc/lucene/java/trunk/src/gcj/org/apache/lucene/index/GCJTermDocs.cc?view=markup

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 9, 2009 at 12:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662339#action_12662339 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    Mmm. I think I might have given IndexWriter.commit() slightly different
    semantics. Specifically, I might have given it a boolean "sync" argument
    which defaults to false.
    {quote}

    It seems like there are various levels of increasing "durability" here:

    * Make available to a reader in the same JVM (eg flush new segment
    to a RAMDir) -- not exposed today.

    * Make available to a reader sharing the filesystem right now (flush
    to Directory in "real" filesystem, but don't sync) -- exposed
    today (but deprecated as a public API) as flush.

    * Make available to readers even if OS/machine crashes (flush to
    Directory in "real" filesystem, and sync) -- exposed today as commit.

    {quote}
    Two comments. First, if you don't sync, but rather leave it up to the OS when
    it wants to actually perform the actual disk i/o, how expensive is flushing? Can
    we make it cheap enough to meet Jason's absolute change rate requirements?
    {quote}

    Right I've been wondering the same thing. I think this should be our
    first approach to realtime indexing, and then we swap in RAMDir if
    performance is not good enough.

    {quote}
    Second, the multi-index model is very tricky when dealing with "updates". How
    do you guarantee that you always see the "current" version of a given
    document, and only that version? When do you expose new deletes in the
    RAMDirectory, when do you expose new deletes in the FSDirectory, how do you
    manage slow merges from the RAMDirectory to the FSDirectory, how do you manage
    new adds to the RAMDirectory that take place during slow merges...

    Building a single-index, two-writer model that could handle fast updates while
    performing background merging was one of the main drivers behind the tombstone
    design.
    {quote}

    I'm not proposing multi-index model (at least I think I'm not!). A
    single IW could flush new tiny segments into a RAMDir and later merge
    them into a real Dir. But I agree: let's start w/ a single Dir and
    move to RAMDir only if necessary.

    {quote}
    Building a single-index, two-writer model that could handle fast updates while
    performing background merging was one of the main drivers behind the tombstone
    design.
    {quote}

    I think carrying the deletions in RAM (reopening the reader) is
    probably fastest for Lucene. Lucene with the "reopened stream of
    readers" approach can do this, but Lucy/KS (with mmap) must use
    filesystem as the intermediary.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 9, 2009 at 12:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662342#action_12662342 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    We can hide the sparse representation and the internal state, having the
    object lazily build the a non-sparse representation. That's what I had in
    mind with the code for TombstoneDelEnum.nextDeletion().
    TombstoneDelEnum.nextInternal() would be a private method used for building up
    the internal BitVector.
    {quote}

    Got it, though for a low deletion rate presumably you'd want to store
    the int docIDs directly so iterating through them doesn't require O(N)
    scan for the next set bit.

    I think what you'd want to lazily do is merge the N tombstone streams
    for this one segment into a single data structure; whether that data
    structure is sparse or unsparse is a separate decision.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 9, 2009 at 12:06 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662345#action_12662345 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    {quote}
    We could potentially make that choice with Lucy/KS, thus allowing us to remove
    the deletions check in the PostingList iterator (as above) and getting a
    potential speedup. But even then I hesitate to push the deletions API upwards
    into a space where users of raw Scorer and TermDocs classes have to deal with
    it - especially since iterator-style deletions aren't very user-friendly.
    {quote}

    Can't you just put sugar on top? Ie, add an API that matches today's
    API and efficiently applies the "deleted docs filter" for you? Then
    you have 100% back compat?

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 9, 2009 at 12:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662347#action_12662347 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    Under the current system, I'm not certain that the deletions checks are that
    excessive.
    {quote}

    I made a simple test that up-front converted the deleted docs
    BitVector into an int[] containing the deleted docsIDs, and then did
    the same nextDeletedDoc change to SegmentTermDocs.

    This was only faster if < 10% of docs were deleted, though I didn't do
    very thorough testing.

    I think the problem is with this change the CPU must predict a new
    branch (docNum >= nextDeletion) vs doing a no-branching lookup of the
    bit.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 9, 2009 at 1:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662362#action_12662362 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------


    {quote}
    One way to think of the realtime problem is in terms of segments rather than FSDirs and RAMDirs. Some segments are on disk, some in RAM.
    {quote}

    I think that's a good approach. IndexWriter could care less which dir
    each segment resides in.

    For starters let's flush all new segments into a single Directory (and
    swap in RAMDir, single-file-Dir, etc., if/when necessary for better
    performance).

    Thinking out loud.... what if we added IndexWriter.getReader(), which
    returns an IndexReader for the current index?

    It would read segments that are not "visible" to a newly opened
    IndexReader on that directory (ie, the SegmentInfos inside IndexWriter
    was used to do the open, not the latest segments_N in the Directory).

    That IndexReader is free to do deletions / set norms (shares write
    lock under the hood w/ IndexWriter), and when reopen is called it
    grabs the current SegmentInfos from IndexWriter and re-opens that.

    We would un-deprecate flush(). The, the transaction layer would make
    changes, call flush(), call reopen(), and return the reopened reader.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 14, 2009 at 5:52 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663815#action_12663815 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    To simplify the patch, can we agree to add a method,
    IndexReader.getDeletedDocs that returns a DocIdSet? This way
    applications may have access to deleteDocs and not be encumbered by
    the IR.isDeleted method.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Jan 14, 2009 at 6:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663837#action_12663837 ]

    Yonik Seeley commented on LUCENE-1476:
    --------------------------------------

    bq. IndexReader.getDeletedDocs that returns a DocIdSet

    Seems like a good idea, but the difficulty lies in defining the semantics of such a call.
    Are subsequent deletes reflected in the returned DocIdSet? Perhaps make this undefined - may or may not be.
    Can the DocIdSet be used concurrently with IndexReader.deleteDocument()? This could work (w/o extra undesirable synchronization) if we're careful.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 14, 2009 at 10:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663903#action_12663903 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    I think it should be up to the user. If the user concurrently
    modifies then they're responsible for the possibly spurious effects.
    However if we want to be protective, a philosophy I don't think works
    well in LUCEN-1516 either, we can offer IR.getDeletedDocs only from a
    read only index reader. This solves the issues brought up such as
    "undesirable synchronization".
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 15, 2009 at 10:44 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664083#action_12664083 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    I think it's fine to expose this API, mark it expert & subject to
    change, and state that it simply returns the current deleted docs
    BitVector and any synchronization issues must be handled by the app.
    Probably it should be package private or protected.

    Also, I think Multi*Reader would not implement this API? Meaning
    you must get the individual SegmentReader and ask it.

    Since we are discussing possible changes to deleted docs, eg switching
    to iterator-only accesss, maybe applying deletions as a filter, maybe
    using "tombstones" under-the-hood, etc., this API could very well
    change.

    For our current thinking on realtime search, I think one weak part of
    the design is the blocking copy-on-write required for the first
    deletion on a newly cloned reader. The time taken is in proportion to
    the size of the segment. Switching to a design that can background
    this copy-on-write will necessarily impact how we represent
    and access deletions.

    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 16, 2009 at 5:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664566#action_12664566 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    {quote}
    it's fine to expose this API, mark it expert & subject to
    change, and state that it simply returns the current deleted docs
    BitVector and any synchronization issues must be handled by the app.
    Probably it should be package private or protected.
    {quote}

    +1 Sounds good

    {quote}
    Also, I think Multi*Reader would not implement this API? Meaning
    you must get the individual SegmentReader and ask it.
    {quote}

    Why? The returned iterator can traverse the multiple bitvectors.

    {quote}
    I think one weak part of
    the design is the blocking copy-on-write required for the first
    deletion on a newly cloned reader. The time taken is in proportion to
    the size of the segment.
    {quote}

    If the segment is large, tombstones can solve this.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Jan 16, 2009 at 5:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664570#action_12664570 ]

    Michael McCandless commented on LUCENE-1476:
    --------------------------------------------

    bq. Why? The returned iterator can traverse the multiple bitvectors.

    Woops, sorry: I missed that it would return a DocIdSet (iterator only) vs underlying (current) BitVector. So then MultiReader could return a DocIdSet.

    bq. If the segment is large, tombstones can solve this.

    Right; I was saying, as things are today (single BitVector holds all deleted docs), one limitation of the realtime approach we are moving towards is the copy-on-write cost of the first delete on a freshly cloned reader for a large segment.

    If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 16, 2009 at 5:29 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664573#action_12664573 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    {quote}
    If we moved to using only iterator API for accessing deleted docs within Lucene then we could explore fixes for the copy-on-write cost w/o changing on-disk representation of deletes. IE tombstones are perhaps overkill for Lucene, since we're not using the filesystem as the intermediary for communicating deletes to a reopened reader. We only need an in-RAM incremental solution.
    {quote}

    +1 Agreed. Good point about not needing to change the on disk representation as that would make implementation a bit more complicated. Sounds like we need a tombstones patch as well that plays well with IndexReader.clone.

    Exposing deleted docs as a DocIdSet allows possible future implementations that DO return deleted docs as discussed (via a flag to IndexReader) from TermDocs. Deleted docs DocIdSet can then be used on a higher level as a filter/query.
    BitVector implement DocIdSet
    ----------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 16, 2009 at 5:31 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Description: Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader. (was: BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.)
    Summary: BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs (was: BitVector implement DocIdSet)
    BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
    -----------------------------------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 16, 2009 at 11:24 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664732#action_12664732 ]

    Jason Rutherglen commented on LUCENE-1476:
    ------------------------------------------

    Do we want to implement SegmentTermDocs using DocIdSet in another patch?
    BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
    -----------------------------------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Jan 21, 2009 at 1:42 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1476:
    -------------------------------------

    Attachment: LUCENE-1476.patch

    First cut. All tests pass.

    * Implemented IndexReader.getDeletedDocs in all readers

    * Created MultiDocIdSet that iterates over multiple DocIdSets which is
    used by MultiSegmentReader and MultiReader

    * Incorporated Marvin's patch into SegmentTermDocs and BitVector.
    SegmentTermDocs iterates using deleted docs DocIdSet instead of
    calling BitVector.get.
    BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
    -----------------------------------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, LUCENE-1476.patch, quasi_iterator_deletions.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 21, 2009 at 2:30 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Marvin Humphrey updated LUCENE-1476:
    ------------------------------------

    Attachment: quasi_iterator_deletions_r2.diff

    Jason,
    Incorporated Marvin's patch into SegmentTermDocs and BitVector.
    I believe there's an inefficiency in my original patch. Code like this occurs in three places:

    {code}
    + if (doc >= nextDeletion) {
    + if (doc > nextDeletion)
    + nextDeletion = deletedDocs.nextSetBit(doc);
    + if (doc == nextDeletion)
    + continue;
    }
    {code}

    While technically correct, extra work is being done. When nextSetBit() can't
    find any more bits, it returns -1 (just like java.util.BitSet.nextSetBit() does).
    This causes the deletion loop to be checked on each iteration.

    The solution is to test for -1, as in the new version of the patch (also
    tested but not benchmarked):

    {code}
    + if (doc >= nextDeletion) {
    + if (doc > nextDeletion) {
    + nextDeletion = deletedDocs.nextSetBit(doc);
    + if (nextDeletion == -1) {
    + nextDeletion = Integer.MAX_VALUE;
    + }
    + }
    + if (doc == nextDeletion) {
    + continue;
    + }
    }
    {code}

    Theoretically, we could also change the behavior of nextSetBit() so that it
    returns Integer.MAX_VALUE when it can't find any more set bits. However,
    that's a little misleading (since it's a positive number and could thus
    represent a true set bit), and also would break the intended API mimicry by
    BitVector.nextSetBit() of java.util.BitSet.nextSetBit().
    BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
    -----------------------------------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, LUCENE-1476.patch, quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 21, 2009 at 4:24 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665697#action_12665697 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------

    Mike McCandless:
    If we move the deletions filtering up, then we'd increase traffic through
    that cache
    OK, right. So we may have some added cost because of this. I think it's only
    TermScorer that uses the bulk API though.
    The original BooleanScorer also pre-fetches. (That doesn't affect KS because
    ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have
    collectively replaced BooleanScorer) all proceed doc-at-a-time and implement
    skipping.)

    And I'm still not certain it's a good idea from an API standpoint: it's
    strange to have the PostingList and Scorer iterators included deleted docs.

    Nevertheless, I've now changed over KS to use the deletions-as-filter approach
    in svn trunk. The tie-breaker was related to the ongoing modularization of
    IndexReader: if PostingList doesn't have to handle deletions, then
    PostingsReader doesn't have to know about DeletionsReader, and if
    PostingsReader doesn't have to know about DeletionsReader, then all
    IndexReader sub-components can be implemented independently.

    The code to implement the deletion skipping turned out to be more verbose and
    fiddly than anticipated. It's easy to make fencepost errors when dealing with
    advancing two iterators in sync, especially when you can only invoke the
    skipping iterator method once for a given doc num.

    {code}
    void
    Scorer_collect(Scorer *self, HitCollector *collector, DelEnum *deletions,
    i32_t doc_base)
    {
    i32_t doc_num = 0;
    i32_t next_deletion = deletions ? 0 : I32_MAX;

    /* Execute scoring loop. */
    while (1) {
    if (doc_num > next_deletion) {
    next_deletion = DelEnum_Advance(deletions, doc_num);
    if (next_deletion == 0) { next_deletion = I32_MAX; }
    continue;
    }
    else if (doc_num == next_deletion) {
    /* Skip past deletions. */
    while (doc_num == next_deletion) {
    /* Artifically advance scorer. */
    while (doc_num == next_deletion) {
    doc_num++;
    next_deletion = DelEnum_Advance(deletions, doc_num);
    if (next_deletion == 0) { next_deletion = I32_MAX; }
    }
    /* Verify that the artificial advance actually worked. */
    doc_num = Scorer_Advance(self, doc_num);
    if (doc_num > next_deletion) {
    next_deletion = DelEnum_Advance(deletions, doc_num);
    }
    }
    }
    else {
    doc_num = Scorer_Advance(self, doc_num + 1);
    if (doc_num >= next_deletion) {
    next_deletion = DelEnum_Advance(deletions, doc_num);
    if (doc_num == next_deletion) { continue; }
    }
    }

    if (doc_num) {
    HC_Collect(collector, doc_num + doc_base, Scorer_Tally(self));
    }
    else {
    break;
    }
    }
    }
    {code}


    BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
    -----------------------------------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, LUCENE-1476.patch, quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Paul Elschot (JIRA) at Jan 21, 2009 at 7:37 am
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665748#action_12665748 ]

    Paul Elschot commented on LUCENE-1476:
    --------------------------------------
    If we move the deletions filtering up, then we'd increase traffic through
    that cache
    OK, right. So we may have some added cost because of this. I think it's only
    TermScorer that uses the bulk API though.
    The original BooleanScorer also pre-fetches. (That doesn't affect KS because
    ORScorer, ANDScorer, NOTScorer and RequiredOptionScorer (which have
    collectively replaced BooleanScorer) all proceed doc-at-a-time and implement
    skipping.)
    For OR like queries, that use Scorer.next(), deletions might be treated as early
    as possible, since each hit will cause a matching doc and a corresponding score calculation.

    For AND like queries, that use Scorer.skipTo(), deletions can be treated later,
    for example in skipTo() of the conjunction/and/scorer.

    In the same way, prefetching into a larger term documents buffer helps for OR queries,
    but gets in the way for AND queries.

    BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
    -----------------------------------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, LUCENE-1476.patch, quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Jan 21, 2009 at 6:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665894#action_12665894 ]

    Marvin Humphrey commented on LUCENE-1476:
    -----------------------------------------
    For OR like queries, that use Scorer.next(), deletions might be treated as
    early as possible, since each hit will cause a matching doc and a
    corresponding score calculation.
    It depends whether the OR-like query spawns a Scorer that pre-fetches.

    At present, deletions are handled in Lucene at the SegmentTermDocs level.
    Filtering deletions that early minimizes overhead from *all* Scorer.next()
    implementations, including those that pre-fetch like the original
    BooleanScorer. The cost is that we must check every single posting in every
    single SegmentTermDocs to see if the doc has been deleted.

    The Scorer_Collect() routine above, which skips past deletions, consolidates
    the cost of checking every single posting into one check. However, the
    original BooleanScorer can't skip, so it will incur overhead from scoring
    documents which have been deleted. Do the savings and the additional costs
    balance each other out? Its hard to say.

    However, with OR-like Scorers which implement skipping and don't pre-fetch --
    Lucene's DisjunctionSumScorer and its KS analogue ORScorer -- you get the best
    of both worlds. Not only do you avoid the per-posting-per-SegTermDocs
    deletions check, you get to skip past deletions and avoid the extra overhead
    in Scorer.next().

    Of course the problem is that because ScorerDocQueue isn't very efficient,
    DisjunctionSumScorer and ORScorer are often slower than the distributed sort
    in the original BooleanScorer.
    For AND like queries, that use Scorer.skipTo(), deletions can be treated
    later, for example in skipTo() of the conjunction/and/scorer.
    Yes, and it could be implemented by adding a Filter sub-clause to the
    conjunction/and/scorer.


    BitVector implement DocIdSet, IndexReader returns DocIdSet deleted docs
    -----------------------------------------------------------------------

    Key: LUCENE-1476
    URL: https://issues.apache.org/jira/browse/LUCENE-1476
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Affects Versions: 2.4
    Reporter: Jason Rutherglen
    Priority: Trivial
    Attachments: LUCENE-1476.patch, LUCENE-1476.patch, quasi_iterator_deletions.diff, quasi_iterator_deletions_r2.diff

    Original Estimate: 12h
    Remaining Estimate: 12h

    Update BitVector to implement DocIdSet. Expose deleted docs DocIdSet from IndexReader.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedJan 8, '09 at 12:01a
activeJan 30, '09 at 10:47p
posts91
users6
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase