[ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661786#action_12661786 ]

Marvin Humphrey commented on LUCENE-1476:

While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared.

It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones. With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work.

To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level. PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

MatchAllScorer_next(MatchAllScorer* self)
if (++self->doc_num > self->max_docs) {
return 0;
return self->doc_num;

Scorer_collect(Scorer *self, Hitcollector *collector, DelEnum *del_enum)
i32_t next_deletion = del_enum ? 0 : I32_MAX;
i32_t doc_num = 1;
while (1) {
while (doc_num >= next_deletion) {
next_deletion = DelEnum_Skip_To(del_enum, target);
while (doc_num == next_deletion) {
next_deletion = DelEnum_Skip_To(del_enum, doc_num);
doc_num = Scorer_Skip_To(scorer, doc_num);
if (doc_num) {
HC_Collect(collector, doc_num, Scorer_Tally(scorer));
else {

The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers.

I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy.

Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right? As soon as we do away with random access, we have to keep state. Dunno if it's going to be noticeable, but it's conceptually annoying.
BitVector implement DocIdSet

Key: LUCENE-1476
URL: https://issues.apache.org/jira/browse/LUCENE-1476
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.4
Reporter: Jason Rutherglen
Priority: Trivial
Attachments: LUCENE-1476.patch

Original Estimate: 12h
Remaining Estimate: 12h

BitVector can implement DocIdSet. This is for making SegmentReader.deletedDocs pluggable.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 91 | next ›
Discussion Overview
groupjava-dev @
postedJan 8, '09 at 12:01a
activeJan 30, '09 at 10:47p



site design / logo © 2021 Grokbase