Jason Rutherglen wrote:
"But I think for realtime we don't want to be using IW's deletion at
all. We should do all deletes via the IndexReader. In fact if IW has
handed out a reader (via getReader()) and that reader (or a reopened
derivative) remains open we may have to block deletions via IW. Not
Can't IW use the IR to do it's deletions? Currently deletions in IW
are implemented in DocumentsWriter.applyDeletes by loading a segment
with SegmentReader.get() and making the deletions which causes term
index load overhead per flush. If IW has an internal IR then the
deletion process can use it (not SegmentReader.get) and there should
not be a conflict anymore between the IR and IW deletion processes.
Today, IW quickly opens each SegmentReader, applies deletes, then
commits & closes it, because we have considered it too costly to leave
these readers open.
But if you've opened a persistent IR via the IndexWriter anyway, we
should use the SegmentReaders from that IR instead.
It seems like the joint IR+IW would allow you to do adds, deletes,
setNorms, all of which are not visible in the exposed IR until
IR.reopen is called. reopen would then flush any added docs to new
segments, materialize any buffered deletes into the BitVectors (or
future transactional sorted int tree thingy), likewise for norms, and
then return a new IR.
Ie, the IR becomes transactional as well -- deletes are not visible
immeidately until reopen is called (unlike today when you delete via
IR). I think this means, internally when IW wants to make changes to
the shared IR, it should make a clone() and do the changes privately
to that instance. Then when reopen is called, we must internally
reopen that clone() such that its deleted docs are carried over to the
newly reopened reader and newly flushed docs from IW are visible as
And on reopen, the deletes should not be flushed to the Directory --
they only need to be "moved" into each SegmentReader's deletedDocs.
We'd also need to ensure when a merge kicks off, the SegmentReaders
used by the merging are not newly reopened but also "borrowed" from
the already open IR. This could actually mean that some deleted docs
get merged away before the deletions ever get flushed to the Directory.
"we may have to block deletions via IW"
Hopefully they can be buffered.
Where else does the write lock need to be coordinated between IR and
"somehow IW & IR have to "split" the write lock else we may
need to merge deletions somehow."
This is a part I'd like to settle on before start of
implementation. It looks like in IW deletes are buffered as terms
or queries until flushed. I don't think there needs to be a lock
until the flush is performed?
For the merge changes to the index, the deletionpolicy can be used
to insure a reader still has access to the segments it needs from
the main directory.
The write lock is held to prevent multiple writers from buffering and
then writing changes to the index. Since we will have this joint
IR/IW share state, as long as we properly synchronize/share things
between IR/IW, it's fine if they both "share" the write lock.
It seems like IR.reopen suddenly means "have IW materialize all
pending stuff and give me a new reader", where stuff is adds &
deletes. Adds must materialize via the directory. Deletes can
materialize entirely in RAM. Likewise for norms.
When IW.commit is called, it also then asks each SegmentReader to
commit. Ie, IR.commit would not be used.
"We have to test performance to measure the net add -> search
For many apps this approach may be plenty fast. If your IO system is
an SSD it could be extremely fast. Swapping in RAMDir
just makes it faster w/o changing the basic approach."
It is true that this is best way to start and in fact may be good
enough for many users. It could help new users to expose a reader
from IW so the delineation between them is removed and Lucene
becomes easier to use.
At the very least this system allows concurrently updateable IR and
IW due to sharing the write lock something that has is currently
incorrect in Lucene.
I wouldn't call it "incorrect". It was an explicit design tradeoff to
make the division between IR & IW, and done for many good reasons. We
are now talking about relaxing that and it clearly raises a number of
"Besides the transaction log (for crash recovery), which should fit
"above" Lucene nicely, what else is needed for realtime beyond the
single-transaction support Lucene already provides?"
What we have described above (exposing IR via IW) will be sufficient
and realtime will live above it.
In this model, the combined IR+IW is still jointly transactional, in
that the IW's commit() method still behaves as it does today. It's just
that the IR that's linked to the IW is allowed to "see" changes, shared
only in RAM, that a freshly opened IR on the index would not see until
commit has been called.
To unsubscribe, e-mail: firstname.lastname@example.org
For additional commands, e-mail: email@example.com