FAQ

Andrzej Bialecki wrote:
No matter whether you are right or wrong, please keep a civil tone on
this public forum.
+1 Ad-hominem remarks are anti-community.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Doug Cutting at Jan 5, 2009 at 7:19 pm

    Robert Engels wrote:
    Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) put them outside their comfort zone, they are ignored or dismissed in a tone that is designed to limit any further questions (especially those that might question their ability and/or understanding).
    There are no "Lucene managers". We are a collaborative community. As
    with any community, all are not equally informed in all matters, and
    some may not realize they are uninformed. Consensus building is an art.
    One cannot simply assert that one is correct. One must rather
    convince others. Offending them is not a good start. Polite
    persistence, illustrative examples and patches are often successful.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert engels at Jan 5, 2009 at 7:29 pm
    Then your comments are misdirected.
    On Jan 5, 2009, at 1:19 PM, Doug Cutting wrote:

    Robert Engels wrote:
    Do what you like. You obviously will. This is the problem with
    the Lucene managers - the problems are only the ones they see -
    same with the solutions. If the solution (or questions) put them
    outside their comfort zone, they are ignored or dismissed in a
    tone that is designed to limit any further questions (especially
    those that might question their ability and/or understanding).
    There are no "Lucene managers". We are a collaborative community.
    As with any community, all are not equally informed in all matters,
    and some may not realize they are uninformed. Consensus building
    is an art. One cannot simply assert that one is correct. One must
    rather convince others. Offending them is not a good start.
    Polite persistence, illustrative examples and patches are often
    successful.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 5, 2009 at 8:55 pm
    +1 Agreed, the initial version should use RAMDirectory in order to keep
    things simple and to benchmark against other MemoryIndex like index
    representations.
    On Fri, Dec 26, 2008 at 10:20 AM, Doug Cutting wrote:

    Michael McCandless wrote:
    So then I think we should start with approach #2 (build real-time on
    top of the Lucene core) and iterate from there. Newly added docs go
    into a tiny segments, which IndexReader.reopen pulls in. Replaced or
    deleted docs record the delete against the right SegmentReader (and
    LUCENE-1314 lets reopen carry those pending deletes forward, in RAM).

    I would take the simple approach first: use ordinary SegmentReader on
    a RAMDirectory for the tiny segments. If that proves too slow, swap
    in Memory/InstantiatedIndex for the tiny segments. If that proves too
    slow, build a reader impl that reads from DocumentsWriter RAM buffer.
    +1 This sounds like a good approach to me. I don't see any fundamental
    reasons why we need different representations, and fewer implementations of
    IndexWriter and IndexReader is generally better, unless they get way too
    hairy. Mostly it seems that real-time can be done with our existing toolbox
    of datastructures, but with some slightly different control structures.
    Once we have the control structure in place then we should look at
    optimizing data structures as needed.

    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 9, 2009 at 3:25 am
    Based on our discussions, it seems best to get realtime search going in
    small steps. Below are some possible steps to take.

    Patch #1: Expose an IndexWriter.getReader method that returns the current
    reader and shares the write lock
    Patch #2: Implement a realtime ram index class
    Patch #3: Implement realtime transactions in IndexWriter or in a subclass of
    IndexWriter by implementing a createTransaction method that generates a
    realtime Transaction object. When the transaction is flushed, the
    transaction index modifications are available via the getReader method of
    IndexWriter

    The remaining question is how to synchronize the flushes to disk with
    IndexWriter's other index update locking mechanisms. The flushing could
    simply use IW.addIndexes which has in place a locking mechanism. After
    flushing to disk, queued deletes would be applied to the newly copied disk
    segments. I think this entails opening the newly copied disk segments and
    applying deletes that occurred to the corresponding ram segments by cloning
    the new disk segments and replacing the deleteddocs bitvector then flushing
    the deleteddocs to disk. This system would allow us to avoid using UID in
    documents.

    The API needs to clearly separate realtime transactions vs. the existing
    index update method such as addDocument, deleteDocuments, and
    updateDocument. I don't think it's possible to transparently implement both
    because the underlying implementations behave differently. It is expected
    that multiple transaction may be created at once however the
    Transaction.flush method would block.
  • Michael McCandless at Jan 9, 2009 at 1:40 pm

    Jason Rutherglen wrote:

    Patch #1: Expose an IndexWriter.getReader method that returns the
    current reader and shares the write lock
    I tentatively like this approach so far...

    That reader is opened using IndexWriter's SegmentInfos instance, so it
    can read segments & deletions that have been flushed but not
    committed. It's allowed to do its own deletions & norms updating.
    When reopen() is called, it grabs the writers SegmentInfos again.
    Patch #2: Implement a realtime ram index class
    I think this one is optional, or, rather an optimazation that we can
    swap in later if/when necessary? Ie for starters little segments are
    written into the main Directory.
    Patch #3: Implement realtime transactions in IndexWriter or in a
    subclass of IndexWriter by implementing a createTransaction method
    that generates a realtime Transaction object. When the transaction
    is flushed, the transaction index modifications are available via
    the getReader method of IndexWriter
    Can't this be layered on top?

    Or... are you looking to add support for multiple transactions in
    flight at once on IndexWriter?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 9, 2009 at 5:03 pm
    M.M.: "That reader is opened using IndexWriter's SegmentInfos instance, so
    it
    can read segments & deletions that have been flushed but not
    committed. It's allowed to do its own deletions & norms updating.
    When reopen() is called, it grabs the writers SegmentInfos again."

    Are you referring to the IW.pendingCommit SegmentInfos variable? When you
    say "flushed" you are referring to the IW.prepareCommit method?

    I think step #1 is important and should be generally useful outside of
    realtime search, however it's unclear how/when calls to IW.deleteDocument
    will reflect in IW.getReader? I assumed that IW.commit would result in
    IW.deleteDocument changes showing up in IW.getReader. Calls to
    Transaction.deleteDocument/flush would show up immediately otherwise it's
    generally unclear to the user the semantics of realtime indexing vs. IW
    based batch indexing use cases. With IW indexing one adds documents and
    deletes documents then does a global commit to the main directory.
    Interleaving deletes with documents added isn't possible because if the
    documents are in the IW ram buffer, they are not necessarily deleted, so it
    seems that if the semantics are such that IW.commit or IW.prepareCommit
    expose deletes via IW.getReader, what is the difference compared to
    IndexReader.reopen on the index except the shared write lock? Ok, perhaps
    this is all one gets and as you mentioned the rest is placed on a level
    above IW which hopefully does not confuse the user.

    M.M.: "
    Patch #2: Implement a realtime ram index class
    I think this one is optional, or, rather an optimazation that we can
    swap in later if/when necessary? Ie for starters little segments are
    written into the main Directory."

    If this is swapped in later how is the system realtime except perhaps
    deletes?

    M.M.: "Can't this be layered on top?
    Or... are you looking to add support for multiple transactions in
    flight at once on IndexWriter?"

    The initial version can be layered on top, that will make testing easier.
    Adding support for multiple transactions at once on IndexWriter outside of
    the realtime transactions seems to require a lot of refactoring.

    On Fri, Jan 9, 2009 at 5:39 AM, Michael McCandless wrote:


    Jason Rutherglen wrote:

    Patch #1: Expose an IndexWriter.getReader method that returns the current
    reader and shares the write lock
    I tentatively like this approach so far...

    That reader is opened using IndexWriter's SegmentInfos instance, so it
    can read segments & deletions that have been flushed but not
    committed. It's allowed to do its own deletions & norms updating.
    When reopen() is called, it grabs the writers SegmentInfos again.

    Patch #2: Implement a realtime ram index class
    I think this one is optional, or, rather an optimazation that we can
    swap in later if/when necessary? Ie for starters little segments are
    written into the main Directory.

    Patch #3: Implement realtime transactions in IndexWriter or in a subclass
    of IndexWriter by implementing a createTransaction method that generates a
    realtime Transaction object. When the transaction is flushed, the
    transaction index modifications are available via the getReader method of
    IndexWriter
    Can't this be layered on top?

    Or... are you looking to add support for multiple transactions in
    flight at once on IndexWriter?

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless at Jan 9, 2009 at 7:15 pm

    Jason Rutherglen wrote:

    Are you referring to the IW.pendingCommit SegmentInfos variable?
    No, I'm referring to segmentInfos. (pendingCommit is the "snapshot"
    of segmentInfos taken when committing...).
    When you say "flushed" you are referring to the IW.prepareCommit method?
    No, I'm referring to "flush"... it writes a new segment but not a new
    segments_N, does not sync the files, and does not invoke the deletion
    policy.
    I think step #1 is important and should be generally useful outside of realtime search, however it's unclear how/when calls to IW.deleteDocument will reflect in IW.getReader?
    You'd have to flush (to materialize pending deletions inside IW) then
    reopen the reader, to see any deletions done via the writer. But I
    think instead realtime search would do deletions via the reader
    (because if you use IW you're updating deletes through the Directory =
    too slow).
    Interleaving deletes with documents added isn't possible because if the documents are in the IW ram buffer, they are not necessarily deleted
    Well, we buffer the delete and then on flush we materialize the
    delete. So if you add a doc with field X=77, then delete-by-term
    X:77, then flush, you'll flush a 1 document segment whose only
    document is marked as deleted.

    But I think for realtime we don't want to be using IW's deletion at
    all. We should do all deletes via the IndexReader. In fact if IW has
    handed out a reader (via getReader()) and that reader (or a reopened
    derivative) remains open we may have to block deletions via IW. Not
    sure... somehow IW & IR have to "split" the write lock else we may
    need to merge deletions somehow.
    If this is swapped in later how is the system realtime except perhaps deletes?
    We have to test performance to measure the net add -> search latency.
    For many apps this approach may be plenty fast. If your IO system is
    an SSD it could be extremely fast. Swapping in RAMDir
    just makes it faster w/o changing the basic approach.
    Adding support for multiple transactions at once on IndexWriter outside of the realtime transactions seems to require a lot of refactoring.
    Besides the transaction log (for crash recovery), which should fit
    "above" Lucene nicely, what else is needed for realtime beyond the
    single-transaction support Lucene already provides?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 9, 2009 at 8:19 pm
    "But I think for realtime we don't want to be using IW's deletion at
    all. We should do all deletes via the IndexReader. In fact if IW has
    handed out a reader (via getReader()) and that reader (or a reopened
    derivative) remains open we may have to block deletions via IW. Not
    sure..."

    Can't IW use the IR to do it's deletions? Currently deletions in IW are
    implemented in DocumentsWriter.applyDeletes by loading a segment with
    SegmentReader.get() and making the deletions which causes term index load
    overhead per flush. If IW has an internal IR then the deletion process can
    use it (not SegmentReader.get) and there should not be a conflict anymore
    between the IR and IW deletion processes.
    "we may have to block deletions via IW"
    Hopefully they can be buffered.

    Where else does the write lock need to be coordinated between IR and IW?
    "somehow IW & IR have to "split" the write lock else we may
    need to merge deletions somehow."

    This is a part I'd like to settle on before start of implementation. It
    looks like in IW deletes are buffered as terms or queries until flushed. I
    don't think there needs to be a lock until the flush is performed?

    For the merge changes to the index, the deletionpolicy can be used to insure
    a reader still has access to the segments it needs from the main directory.

    "We have to test performance to measure the net add -> search latency.
    For many apps this approach may be plenty fast. If your IO system is
    an SSD it could be extremely fast. Swapping in RAMDir
    just makes it faster w/o changing the basic approach."

    It is true that this is best way to start and in fact may be good enough for
    many users. It could help new users to expose a reader from IW so the
    delineation between them is removed and Lucene becomes easier to use.

    At the very least this system allows concurrently updateable IR and IW due
    to sharing the write lock something that has is currently incorrect in
    Lucene.
    "Besides the transaction log (for crash recovery), which should fit
    "above" Lucene nicely, what else is needed for realtime beyond the
    single-transaction support Lucene already provides?"

    What we have described above (exposing IR via IW) will be sufficient and
    realtime will live above it.


    On Fri, Jan 9, 2009 at 11:15 AM, Michael McCandless wrote:

    Jason Rutherglen wrote:
    Are you referring to the IW.pendingCommit SegmentInfos variable?
    No, I'm referring to segmentInfos. (pendingCommit is the "snapshot"
    of segmentInfos taken when committing...).
    When you say "flushed" you are referring to the IW.prepareCommit method?
    No, I'm referring to "flush"... it writes a new segment but not a new
    segments_N, does not sync the files, and does not invoke the deletion
    policy.
    I think step #1 is important and should be generally useful outside of
    realtime search, however it's unclear how/when calls to IW.deleteDocument
    will reflect in IW.getReader?

    You'd have to flush (to materialize pending deletions inside IW) then
    reopen the reader, to see any deletions done via the writer. But I
    think instead realtime search would do deletions via the reader
    (because if you use IW you're updating deletes through the Directory =
    too slow).
    Interleaving deletes with documents added isn't possible because if the
    documents are in the IW ram buffer, they are not necessarily deleted

    Well, we buffer the delete and then on flush we materialize the
    delete. So if you add a doc with field X=77, then delete-by-term
    X:77, then flush, you'll flush a 1 document segment whose only
    document is marked as deleted.

    But I think for realtime we don't want to be using IW's deletion at
    all. We should do all deletes via the IndexReader. In fact if IW has
    handed out a reader (via getReader()) and that reader (or a reopened
    derivative) remains open we may have to block deletions via IW. Not
    sure... somehow IW & IR have to "split" the write lock else we may
    need to merge deletions somehow.
    If this is swapped in later how is the system realtime except perhaps
    deletes?

    We have to test performance to measure the net add -> search latency.
    For many apps this approach may be plenty fast. If your IO system is
    an SSD it could be extremely fast. Swapping in RAMDir
    just makes it faster w/o changing the basic approach.
    Adding support for multiple transactions at once on IndexWriter outside
    of the realtime transactions seems to require a lot of refactoring.

    Besides the transaction log (for crash recovery), which should fit
    "above" Lucene nicely, what else is needed for realtime beyond the
    single-transaction support Lucene already provides?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 9, 2009 at 11:43 pm
    I think the IW integrated IR needs a rule regarding the behavior of
    IW.flush and IR.flush. There will need to be a flush lock that is
    shared between the IW and IR. The lock is acquired at the beginning
    of a flush and released immediately after a successful or
    unsuccessful call. We will need to share this lock down to the
    SegmentReader level as presumably IR.getSequentialReaders may be
    called and operated on individually.

    A few questions that need to be answered as to desired behavior. What
    happens when IW flushes w/deletes and IR has pending deletes not
    flushed yet? Can we automatically flush the IR deletes? If not
    automatically flushed are the IR deletes still valid and can the IR
    later flush them and not create a conflict (I think this is doable)?
    Or does the reader become readonly and IR.reopen must be called to
    obtain the new deletes? In the reverse scenario where IW has pending
    deletes and IR flushes deletes, are there issues that arise when IW
    later flushes? I think if it's made clear to the user the
    implications of using IR and IW in combination for deletes then there
    should not be an issue with supporting deletes from IR and IW.

    (I found another way to format with hard line breaks
    http://emailformattool.com/)
  • Michael McCandless at Jan 24, 2009 at 12:29 pm

    Jason Rutherglen wrote:

    "But I think for realtime we don't want to be using IW's deletion at
    all. We should do all deletes via the IndexReader. In fact if IW has
    handed out a reader (via getReader()) and that reader (or a reopened
    derivative) remains open we may have to block deletions via IW. Not
    sure..."

    Can't IW use the IR to do it's deletions? Currently deletions in IW
    are implemented in DocumentsWriter.applyDeletes by loading a segment
    with SegmentReader.get() and making the deletions which causes term
    index load overhead per flush. If IW has an internal IR then the
    deletion process can use it (not SegmentReader.get) and there should
    not be a conflict anymore between the IR and IW deletion processes.
    Today, IW quickly opens each SegmentReader, applies deletes, then
    commits & closes it, because we have considered it too costly to leave
    these readers open.

    But if you've opened a persistent IR via the IndexWriter anyway, we
    should use the SegmentReaders from that IR instead.

    It seems like the joint IR+IW would allow you to do adds, deletes,
    setNorms, all of which are not visible in the exposed IR until
    IR.reopen is called. reopen would then flush any added docs to new
    segments, materialize any buffered deletes into the BitVectors (or
    future transactional sorted int tree thingy), likewise for norms, and
    then return a new IR.

    Ie, the IR becomes transactional as well -- deletes are not visible
    immeidately until reopen is called (unlike today when you delete via
    IR). I think this means, internally when IW wants to make changes to
    the shared IR, it should make a clone() and do the changes privately
    to that instance. Then when reopen is called, we must internally
    reopen that clone() such that its deleted docs are carried over to the
    newly reopened reader and newly flushed docs from IW are visible as
    new SegmentReaders.

    And on reopen, the deletes should not be flushed to the Directory --
    they only need to be "moved" into each SegmentReader's deletedDocs.
    We'd also need to ensure when a merge kicks off, the SegmentReaders
    used by the merging are not newly reopened but also "borrowed" from
    the already open IR. This could actually mean that some deleted docs
    get merged away before the deletions ever get flushed to the Directory.
    "we may have to block deletions via IW"
    Hopefully they can be buffered.

    Where else does the write lock need to be coordinated between IR and
    IW?
    "somehow IW & IR have to "split" the write lock else we may
    need to merge deletions somehow."

    This is a part I'd like to settle on before start of
    implementation. It looks like in IW deletes are buffered as terms
    or queries until flushed. I don't think there needs to be a lock
    until the flush is performed?

    For the merge changes to the index, the deletionpolicy can be used
    to insure a reader still has access to the segments it needs from
    the main directory.
    The write lock is held to prevent multiple writers from buffering and
    then writing changes to the index. Since we will have this joint
    IR/IW share state, as long as we properly synchronize/share things
    between IR/IW, it's fine if they both "share" the write lock.

    It seems like IR.reopen suddenly means "have IW materialize all
    pending stuff and give me a new reader", where stuff is adds &
    deletes. Adds must materialize via the directory. Deletes can
    materialize entirely in RAM. Likewise for norms.

    When IW.commit is called, it also then asks each SegmentReader to
    commit. Ie, IR.commit would not be used.
    "We have to test performance to measure the net add -> search
    latency.
    For many apps this approach may be plenty fast. If your IO system is
    an SSD it could be extremely fast. Swapping in RAMDir
    just makes it faster w/o changing the basic approach."

    It is true that this is best way to start and in fact may be good
    enough for many users. It could help new users to expose a reader
    from IW so the delineation between them is removed and Lucene
    becomes easier to use.

    At the very least this system allows concurrently updateable IR and
    IW due to sharing the write lock something that has is currently
    incorrect in Lucene.
    I wouldn't call it "incorrect". It was an explicit design tradeoff to
    make the division between IR & IW, and done for many good reasons. We
    are now talking about relaxing that and it clearly raises a number of
    "challenging" issues...
    "Besides the transaction log (for crash recovery), which should fit
    "above" Lucene nicely, what else is needed for realtime beyond the
    single-transaction support Lucene already provides?"

    What we have described above (exposing IR via IW) will be sufficient
    and realtime will live above it.
    OK, good.

    In this model, the combined IR+IW is still jointly transactional, in
    that the IW's commit() method still behaves as it does today. It's just
    that the IR that's linked to the IW is allowed to "see" changes, shared
    only in RAM, that a freshly opened IR on the index would not see until
    commit has been called.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 30, 2009 at 1:17 am
    We'd also need to ensure when a merge kicks off, the SegmentReaders
    used by the merging are not newly reopened but also "borrowed" from

    The IW merge code currently opens the SegmentReader with a 4096
    buffer size (different than the 1024 default), how will this case be
    handled?
    reopen would then flush any added docs to new segments
    IR.reopen would call IW.flush?
    When IW.commit is called, it also then asks each SegmentReader to
    commit. Ie, IR.commit would not be used.

    Why is this? SegmentReader.commitChanges would be called instead?
    Then when reopen is called, we must internally reopen that clone()
    such that its deleted docs are carried over to the newly reopened
    reader and newly flushed docs from IW are visible as new
    SegmentReaders.

    If deletes are made to the external reader (meaning the one obtained
    by IW.getReader), then deletes are made via IW.deleteDocument, then
    reopen is called, what happens in this case? We will need to merge
    the del docs from the internal clone into the newly reopened reader?
    the IR becomes transactional as well -- deletes are not visible
    immediately until reopen is called

    Interesting. I'd rather somehow merge the IW and external reader's
    deletes, otherwise it seems like we're radically changing how IR
    works. Perhaps the IW keeps a copy of the external IR that has the
    write lock (thinking of IR.clone where the write lock is passed onto
    the latest clone). This way IW.getReader is about the same as
    reopen/clone (because it will call reopen on presumably the latest
    IR).



    On Sat, Jan 24, 2009 at 4:29 AM, Michael McCandless wrote:

    Jason Rutherglen wrote:
    "But I think for realtime we don't want to be using IW's deletion at
    all. We should do all deletes via the IndexReader. In fact if IW has
    handed out a reader (via getReader()) and that reader (or a reopened
    derivative) remains open we may have to block deletions via IW. Not
    sure..."

    Can't IW use the IR to do it's deletions? Currently deletions in IW are
    implemented in DocumentsWriter.applyDeletes by loading a segment with
    SegmentReader.get() and making the deletions which causes term index load
    overhead per flush. If IW has an internal IR then the deletion process can
    use it (not SegmentReader.get) and there should not be a conflict anymore
    between the IR and IW deletion processes.
    Today, IW quickly opens each SegmentReader, applies deletes, then
    commits & closes it, because we have considered it too costly to leave
    these readers open.

    But if you've opened a persistent IR via the IndexWriter anyway, we
    should use the SegmentReaders from that IR instead.

    It seems like the joint IR+IW would allow you to do adds, deletes,
    setNorms, all of which are not visible in the exposed IR until
    IR.reopen is called. reopen would then flush any added docs to new
    segments, materialize any buffered deletes into the BitVectors (or
    future transactional sorted int tree thingy), likewise for norms, and
    then return a new IR.

    Ie, the IR becomes transactional as well -- deletes are not visible
    immeidately until reopen is called (unlike today when you delete via
    IR). I think this means, internally when IW wants to make changes to
    the shared IR, it should make a clone() and do the changes privately
    to that instance. Then when reopen is called, we must internally
    reopen that clone() such that its deleted docs are carried over to the
    newly reopened reader and newly flushed docs from IW are visible as
    new SegmentReaders.

    And on reopen, the deletes should not be flushed to the Directory --
    they only need to be "moved" into each SegmentReader's deletedDocs.
    We'd also need to ensure when a merge kicks off, the SegmentReaders
    used by the merging are not newly reopened but also "borrowed" from
    the already open IR. This could actually mean that some deleted docs
    get merged away before the deletions ever get flushed to the Directory.
    "we may have to block deletions via IW"

    Hopefully they can be buffered.

    Where else does the write lock need to be coordinated between IR and IW?
    "somehow IW & IR have to "split" the write lock else we may
    need to merge deletions somehow."

    This is a part I'd like to settle on before start of implementation. It
    looks like in IW deletes are buffered as terms or queries until flushed. I
    don't think there needs to be a lock until the flush is performed?

    For the merge changes to the index, the deletionpolicy can be used to
    insure a reader still has access to the segments it needs from the main
    directory.
    The write lock is held to prevent multiple writers from buffering and
    then writing changes to the index. Since we will have this joint
    IR/IW share state, as long as we properly synchronize/share things
    between IR/IW, it's fine if they both "share" the write lock.

    It seems like IR.reopen suddenly means "have IW materialize all
    pending stuff and give me a new reader", where stuff is adds &
    deletes. Adds must materialize via the directory. Deletes can
    materialize entirely in RAM. Likewise for norms.

    When IW.commit is called, it also then asks each SegmentReader to
    commit. Ie, IR.commit would not be used.
    "We have to test performance to measure the net add -> search latency.
    For many apps this approach may be plenty fast. If your IO system is
    an SSD it could be extremely fast. Swapping in RAMDir
    just makes it faster w/o changing the basic approach."

    It is true that this is best way to start and in fact may be good enough
    for many users. It could help new users to expose a reader from IW so the
    delineation between them is removed and Lucene becomes easier to use.

    At the very least this system allows concurrently updateable IR and IW due
    to sharing the write lock something that has is currently incorrect in
    Lucene.
    I wouldn't call it "incorrect". It was an explicit design tradeoff to
    make the division between IR & IW, and done for many good reasons. We
    are now talking about relaxing that and it clearly raises a number of
    "challenging" issues...
    "Besides the transaction log (for crash recovery), which should fit
    "above" Lucene nicely, what else is needed for realtime beyond the
    single-transaction support Lucene already provides?"

    What we have described above (exposing IR via IW) will be sufficient and
    realtime will live above it.
    OK, good.

    In this model, the combined IR+IW is still jointly transactional, in
    that the IW's commit() method still behaves as it does today. It's just
    that the IR that's linked to the IW is allowed to "see" changes, shared
    only in RAM, that a freshly opened IR on the index would not see until
    commit has been called.


    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless at Jan 30, 2009 at 2:05 pm

    Jason Rutherglen wrote:

    We'd also need to ensure when a merge kicks off, the SegmentReaders
    used by the merging are not newly reopened but also "borrowed" from
    The IW merge code currently opens the SegmentReader with a 4096
    buffer size (different than the 1024 default), how will this case be
    handled?
    I think we'd just use 1024 when merging.
    reopen would then flush any added docs to new segments
    IR.reopen would call IW.flush?
    I think it has to? (Whether "it" is IR.reopen, or a class that sits
    on top of both IR & IW, I'm not sure).

    Ie the interface would be you add/delete/updateDoc, setNorm a bunch of
    times, during which none of these changes are visible to your
    currently open reader, followed by "reopen" to get a reader that then
    sees those changes?

    (This is all still brainstorming at this point of course....)
    When IW.commit is called, it also then asks each SegmentReader to
    commit. Ie, IR.commit would not be used.
    Why is this? SegmentReader.commitChanges would be called instead?
    Because IR.commit is doing other stuff (invoking deletion policy,
    syncing newly referenced files, writing new segments file, rollback
    logic on hitting an exception, etc.) that overlaps what IW.commit also
    does. It'd be great to factor this common stuff out so IW and IR
    would share a single source. (Yes, SR.commitChanges would be called
    directly, I think).
    Then when reopen is called, we must internally reopen that clone()
    such that its deleted docs are carried over to the newly reopened
    reader and newly flushed docs from IW are visible as new
    SegmentReaders.
    If deletes are made to the external reader (meaning the one obtained
    by IW.getReader), then deletes are made via IW.deleteDocument, then
    reopen is called, what happens in this case? We will need to merge
    the del docs from the internal clone into the newly reopened reader?
    I guess we could merge them. Ie, deletes made through reader (by
    docID) are immediately visible, but through through writer are
    buffered until a flush or reopen?

    Still, I don't like exposing two ways to do deletions, with two
    different behaviours (buffered or not). It's weird. Maybe, instead,
    all deletes done via IW would be immediate?

    It seems like either 1) all deletes are buffered until reopen, or 2)
    all deletes are immediately materialized. I think half/half is too
    strange.
    the IR becomes transactional as well -- deletes are not visible
    immediately until reopen is called
    Interesting. I'd rather somehow merge the IW and external reader's
    deletes, otherwise it seems like we're radically changing how IR
    works. Perhaps the IW keeps a copy of the external IR that has the
    write lock (thinking of IR.clone where the write lock is passed onto
    the latest clone). This way IW.getReader is about the same as
    reopen/clone (because it will call reopen on presumably the latest
    IR).
    We'd only be "radically changing" how the RealTimeReader works.

    I think the initial approach here might be to simply open up enough
    package-private APIs or subclass-ability on IR and IW so that we can
    experiment with these realtime ideas. Then we iterate w/ different
    experiments to see how things flesh out...

    Actually could you redo LUCENE-1516 now that LUCENE-1314 is in?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 31, 2009 at 12:24 am
    deletes made through reader (by docID) are immediately visible, but
    through writer are buffered until a flush or reopen?

    This is what I was thinking, IW buffers deletes, IR does not. Making
    IW.deletes visible immediately by applying them to the IR makes sense
    as well.

    What should be the behavior of IW.updateDocument?

    LUCENE-1314 is in and we've agreed IR.reopen causes an IW.flush
    so I'll continue the LUCENE-1516 patch.
    On Fri, Jan 30, 2009 at 6:04 AM, Michael McCandless wrote:

    Jason Rutherglen wrote:
    We'd also need to ensure when a merge kicks off, the SegmentReaders
    used by the merging are not newly reopened but also "borrowed" from
    The IW merge code currently opens the SegmentReader with a 4096
    buffer size (different than the 1024 default), how will this case be
    handled?
    I think we'd just use 1024 when merging.
    reopen would then flush any added docs to new segments
    IR.reopen would call IW.flush?
    I think it has to? (Whether "it" is IR.reopen, or a class that sits
    on top of both IR & IW, I'm not sure).

    Ie the interface would be you add/delete/updateDoc, setNorm a bunch of
    times, during which none of these changes are visible to your
    currently open reader, followed by "reopen" to get a reader that then
    sees those changes?

    (This is all still brainstorming at this point of course....)
    When IW.commit is called, it also then asks each SegmentReader to
    commit. Ie, IR.commit would not be used.
    Why is this? SegmentReader.commitChanges would be called instead?
    Because IR.commit is doing other stuff (invoking deletion policy,
    syncing newly referenced files, writing new segments file, rollback
    logic on hitting an exception, etc.) that overlaps what IW.commit also
    does. It'd be great to factor this common stuff out so IW and IR
    would share a single source. (Yes, SR.commitChanges would be called
    directly, I think).
    Then when reopen is called, we must internally reopen that clone()
    such that its deleted docs are carried over to the newly reopened
    reader and newly flushed docs from IW are visible as new
    SegmentReaders.
    If deletes are made to the external reader (meaning the one obtained
    by IW.getReader), then deletes are made via IW.deleteDocument, then
    reopen is called, what happens in this case? We will need to merge
    the del docs from the internal clone into the newly reopened reader?
    I guess we could merge them. Ie, deletes made through reader (by
    docID) are immediately visible, but through through writer are
    buffered until a flush or reopen?

    Still, I don't like exposing two ways to do deletions, with two
    different behaviours (buffered or not). It's weird. Maybe, instead,
    all deletes done via IW would be immediate?

    It seems like either 1) all deletes are buffered until reopen, or 2)
    all deletes are immediately materialized. I think half/half is too
    strange.
    the IR becomes transactional as well -- deletes are not visible
    immediately until reopen is called
    Interesting. I'd rather somehow merge the IW and external reader's
    deletes, otherwise it seems like we're radically changing how IR
    works. Perhaps the IW keeps a copy of the external IR that has the
    write lock (thinking of IR.clone where the write lock is passed onto
    the latest clone). This way IW.getReader is about the same as
    reopen/clone (because it will call reopen on presumably the latest
    IR).
    We'd only be "radically changing" how the RealTimeReader works.

    I think the initial approach here might be to simply open up enough
    package-private APIs or subclass-ability on IR and IW so that we can
    experiment with these realtime ideas. Then we iterate w/ different
    experiments to see how things flesh out...

    Actually could you redo LUCENE-1516 now that LUCENE-1314 is in?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll at Jan 9, 2009 at 7:31 pm

    On Jan 9, 2009, at 8:39 AM, Michael McCandless wrote:
    Jason Rutherglen wrote:
    Patch #1: Expose an IndexWriter.getReader method that returns the
    current reader and shares the write lock
    I tentatively like this approach so far...

    That reader is opened using IndexWriter's SegmentInfos instance, so it
    can read segments & deletions that have been flushed but not
    committed. It's allowed to do its own deletions & norms updating.
    When reopen() is called, it grabs the writers SegmentInfos again.
    Minor design nit...
    We've spent a lot of time up until now getting write functionality out
    of the Reader, and now we are going to add read functionality into the
    Writer? Is that the right thing to do? Perhaps there is an interface
    or some shared objects to be used/exposed or maybe people should get
    Readers/Writers from a factory and you could have a RT Factory and a
    default Factory? Not trying to distract from the deeper issues here,
    but I don't think it makes sense to have the Writer coupled to the
    Reader.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless at Jan 9, 2009 at 7:54 pm

    Grant Ingersoll wrote:

    We've spent a lot of time up until now getting write functionality
    out of the Reader, and now we are going to add read functionality
    into the Writer?
    Well... we're not really adding read functionality into IW; instead,
    we are asking IW to open the reader for us, except the reader is
    provided the SegmentInfos it should use from IW (instead of trying to
    find the latest segments_N file in the Directory).

    Ie, what IW.getReader returns is an otherwise normal
    MultiSegmentReader.

    The goal is to allow an IndexReader to access segments flushed but
    not yet committed by IW. These segments are normally "private" to IW,
    in memory in its SegmentInfos instance.

    And this is all just thinking-out-loud-brainstorming. There are still
    many
    details to work through...

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll at Jan 9, 2009 at 8:05 pm
    I realize we aren't adding read functionality to the Writer, but it
    would be coupling the Writer to the Reader nonetheless. I understand
    it is brainstorming (like I said, not trying to distract from the
    discussion), just saying that if the Reader and the Writer both need
    access to the underlying data structures, then we should refactor to
    make that possible, not just glom the Reader onto the Writer. I
    suspect if that is done, anyway, that it may make the bigger picture a
    bit clearer, too.
    On Jan 9, 2009, at 2:53 PM, Michael McCandless wrote:


    Grant Ingersoll wrote:
    We've spent a lot of time up until now getting write functionality
    out of the Reader, and now we are going to add read functionality
    into the Writer?
    Well... we're not really adding read functionality into IW; instead,
    we are asking IW to open the reader for us, except the reader is
    provided the SegmentInfos it should use from IW (instead of trying to
    find the latest segments_N file in the Directory).

    Ie, what IW.getReader returns is an otherwise normal
    MultiSegmentReader.

    The goal is to allow an IndexReader to access segments flushed but
    not yet committed by IW. These segments are normally "private" to IW,
    in memory in its SegmentInfos instance.

    And this is all just thinking-out-loud-brainstorming. There are
    still many
    details to work through...

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jan 12, 2009 at 5:56 pm
    Grant,

    Do you have a proposal in mind? It would help to suggest something like
    some classes and methods to help understand an alternative to what is being
    discussed.

    -J
    On Fri, Jan 9, 2009 at 12:05 PM, Grant Ingersoll wrote:

    I realize we aren't adding read functionality to the Writer, but it would
    be coupling the Writer to the Reader nonetheless. I understand it is
    brainstorming (like I said, not trying to distract from the discussion),
    just saying that if the Reader and the Writer both need access to the
    underlying data structures, then we should refactor to make that possible,
    not just glom the Reader onto the Writer. I suspect if that is done,
    anyway, that it may make the bigger picture a bit clearer, too.


    On Jan 9, 2009, at 2:53 PM, Michael McCandless wrote:

    Grant Ingersoll wrote:

    We've spent a lot of time up until now getting write functionality out of
    the Reader, and now we are going to add read functionality into the Writer?
    Well... we're not really adding read functionality into IW; instead,
    we are asking IW to open the reader for us, except the reader is
    provided the SegmentInfos it should use from IW (instead of trying to
    find the latest segments_N file in the Directory).

    Ie, what IW.getReader returns is an otherwise normal
    MultiSegmentReader.

    The goal is to allow an IndexReader to access segments flushed but
    not yet committed by IW. These segments are normally "private" to IW,
    in memory in its SegmentInfos instance.

    And this is all just thinking-out-loud-brainstorming. There are still
    many
    details to work through...

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Grant Ingersoll at Jan 12, 2009 at 8:23 pm
    Just thinking out loud... haven't looked at your patch yet (one of
    these days I will be back up for air)

    My initial thought is that you would have a factory that produced both
    the Reader and the Writer as a pair, or was at least aware of what to
    go get from the Writer

    Something like:

    class IndexFactory{
    IndexWriter getWriter()

    IndexReader getReader()

    //Not sure if this is needed yet, but????
    IndexReader getReader(IndexWriter)
    }

    The factory (or whatever you want to call it) is responsible for
    making sure the Writer and Reader have the pieces they need, i.e. the
    SegmentInfos.

    The first getReader will get you the plain old Reader that everyone
    knows and loves today (assuming there is a benefit to keeping it
    around), the second one knows what to get off the Writer to create the
    appropriate Reader.

    It's nothing particularly hard to implement over what you are
    proposing, I don't think. Just trying to keep the Reader out of the
    Writer from an API cleanliness standpoint.

    -Grant

    On Jan 12, 2009, at 12:55 PM, Jason Rutherglen wrote:

    Grant,

    Do you have a proposal in mind? It would help to suggest something
    like some classes and methods to help understand an alternative to
    what is being discussed.

    -J

    On Fri, Jan 9, 2009 at 12:05 PM, Grant Ingersoll
    wrote:
    I realize we aren't adding read functionality to the Writer, but it
    would be coupling the Writer to the Reader nonetheless. I
    understand it is brainstorming (like I said, not trying to distract
    from the discussion), just saying that if the Reader and the Writer
    both need access to the underlying data structures, then we should
    refactor to make that possible, not just glom the Reader onto the
    Writer. I suspect if that is done, anyway, that it may make the
    bigger picture a bit clearer, too.


    On Jan 9, 2009, at 2:53 PM, Michael McCandless wrote:


    Grant Ingersoll wrote:

    We've spent a lot of time up until now getting write functionality
    out of the Reader, and now we are going to add read functionality
    into the Writer?

    Well... we're not really adding read functionality into IW; instead,
    we are asking IW to open the reader for us, except the reader is
    provided the SegmentInfos it should use from IW (instead of trying to
    find the latest segments_N file in the Directory).

    Ie, what IW.getReader returns is an otherwise normal
    MultiSegmentReader.

    The goal is to allow an IndexReader to access segments flushed but
    not yet committed by IW. These segments are normally "private" to IW,
    in memory in its SegmentInfos instance.

    And this is all just thinking-out-loud-brainstorming. There are
    still many
    details to work through...

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
    --------------------------
    Grant Ingersoll

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ
  • Jason Rutherglen at Jan 12, 2009 at 5:53 pm
    "Patch #2: Implement a realtime ram index class I think this one is
    optional, or, rather an optimazation that we can swap in later
    if/when necessary? Ie for starters little segments are written into
    the main Directory."

    John, Zoie could be of use for this patch. In addition, we may want to
    implement flushing the IW ram buffer to a RAMDir for reading as M.M.
    suggested.

    First though the IW -> IR integration LUCENE-1516 needs to be
    implemented otherwise it's not possible to properly execute updates
    in realtime.

    On Fri, Jan 9, 2009 at 5:39 AM, Michael McCandless wrote:


    Jason Rutherglen wrote:

    Patch #1: Expose an IndexWriter.getReader method that returns the current
    reader and shares the write lock
    I tentatively like this approach so far...

    That reader is opened using IndexWriter's SegmentInfos instance, so it
    can read segments & deletions that have been flushed but not
    committed. It's allowed to do its own deletions & norms updating.
    When reopen() is called, it grabs the writers SegmentInfos again.

    Patch #2: Implement a realtime ram index class
    I think this one is optional, or, rather an optimazation that we can
    swap in later if/when necessary? Ie for starters little segments are
    written into the main Directory.

    Patch #3: Implement realtime transactions in IndexWriter or in a subclass
    of IndexWriter by implementing a createTransaction method that generates a
    realtime Transaction object. When the transaction is flushed, the
    transaction index modifications are available via the getReader method of
    IndexWriter
    Can't this be layered on top?

    Or... are you looking to add support for multiple transactions in
    flight at once on IndexWriter?

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • John Wang at Jan 9, 2009 at 4:42 am
    We have worked on this problem on the server level as well. We have also
    open sourced it at:

    http://code.google.com/p/zoie/

    wiki on the realtime aspect:

    http://code.google.com/p/zoie/wiki/ZoieSystem

    -John
    On Fri, Dec 26, 2008 at 12:34 PM, Robert Engels wrote:

    If you move to the "either embedded, or server model", the post reopen is
    trivial, as the structures can be created as the segment is written.

    It is the networked shared access model that causes a lot of these
    optimizations to be far more complex than needed.

    Would it maybe be simpler to move the "embedded or server" model, and add a
    network shared file (e.g. nfs) access model as a layer? The latter is going
    to perform far worse anyway.

    I guess I don't understand why Lucene continues to try and support this
    model. NO ONE does it any more. This is the way MS Access worked, and
    everyone that wanted performance needed to move to SQL server for the server
    model.

    -----Original Message-----
    From: Marvin Humphrey <marvin@rectangular.com>
    Sent: Dec 26, 2008 12:53 PM
    To: java-dev@lucene.apache.org
    Subject: Re: Realtime Search
    On Fri, Dec 26, 2008 at 06:22:23AM -0500, Michael McCandless wrote:
    4) Allow 2 concurrent writers: one for small, fast updates, and one
    for
    big background merges.
    Marvin can you describe more detail here?
    The goal is to improve worst-case write performance.

    Currently, writes are quick most of the time, but occassionally you'll trigger
    a big merge and get stuck. To solve this problem, we can assign a merge
    policy to our primary writer which tells it to merge no more than
    mergeThreshold documents. The value of mergeTheshold will need tuning
    depending on document size, change rate, and so on, but the idea is that we
    want this writer to do as much merging as it can while still keeping
    worst-case write performance down to an acceptable number.

    Doing only small merges just puts off the day of reckoning, of course. By
    avoiding big consolidations, we are slowly accumulating small-to-medium sized
    segments and causing a gradual degradation of search-time performance.

    What we'd like is a separate write process, operating (mostly) in the
    background, dedicated solely to merging segments which contain at least
    mergeThreshold docs.

    If all we have to do is add documents to the index, adding that second write
    process isn't a big deal. We have to worry about competion for segment,
    snapshot, and temp file names, but that's about it.

    Deletions make matters more complicated, but with a tombstone-based deletions
    mechanism, the problems are solvable.

    When the background merge writer starts up, it will see a particular view of
    the index in time, including deletions. It will perform nearly all of its
    operations based on this view of the index, mapping around documents which
    were marked as deleted at init time.

    In between the time when the background merge writer starts up and the time it
    finishes consolidating segment data, we assume that the primary writer will
    have modified the index.

    * New docs have been added in new segments.
    * Tombstones have been added which suppress documents in segments which
    didn't even exist when the background merge writer started up.
    * Tombstones have been added which suppress documents in segments which
    existed when the background merge writer started up, but were not merged.
    * Tombstones have been added which suppress documents in segments which have
    just been merged.

    Only the last category of deletions matters.

    At this point, the background merge writer aquires an exclusive write lock on
    the index. It examines recently added tombstones, translates the document
    numbers and writes a tombstone file against itself. Then it writes the
    snapshot file to commit its changes and releases the write lock.

    Worst case update performance for the system is now the sum of the time it
    takes the background merge writer consolidate tombstones and worst-case
    performance of the primary writer.
    It sounds like this is your solution for "decoupling" segments changes
    due
    to merges from changes from docs being indexed, from a reader's
    standpoint?
    It's true that we are decoupling the process of making logical changes to the
    index from the process of internal consolidation. I probably wouldn't
    describe that as being done from the reader's standpoint, though.

    With mmap and data structures optimized for it, we basically solve the
    read-time responsiveness cost problem. From the client perspective, the delay
    between firing off a change order and seeing that change made live is now
    dominated by the time it takes to actually update the index. The time between
    the commit and having an IndexReader which can see that commit is
    negligible
    in comparision.
    Since you are using mmap to achieve near zero brand-new IndexReader
    creation, whereas in Lucene we are moving towards achieving real-time
    by always reopening a current IndexReader (not a brand new one), it
    seems like you should not actually have to worry about the case of
    reopening a reader after a large merge has finished?
    Even though we can rely on mmap rather than slurping, there are
    potentially a
    lot of files to open and a lot of JSON-encoded metadata to parse, so I'm not
    certain that Lucy/KS will never have to worry about the time it takes to open
    a new IndexReader. Fortunately, we can implement reopen() if we need to.
    We need to deal with this case (background the warming) because
    creating that new SegmentReader (on the newly merged segment) can take
    a non-trivial amount of time.
    Yes. Without mmap or some other solution, I think improvements to
    worst-case
    update performance in Lucene will continue to be constrained by
    post-commit
    IndexReader opening costs.

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless at Jan 9, 2009 at 2:00 pm

    Marvin Humphrey wrote:

    The goal is to improve worst-case write performance.
    ...
    In between the time when the background merge writer starts up and the time it finishes consolidating segment data, we assume that the primary writer will have modified the index.

    * New docs have been added in new segments.
    * Tombstones have been added which suppress documents in segments which didn't even exist when the background merge writer started up.
    * Tombstones have been added which suppress documents in segments which existed when the background merge writer started up, but were not merged.
    * Tombstones have been added which suppress documents in segments which have just been merged.

    Only the last category of deletions matters.

    At this point, the background merge writer aquires an exclusive write lock on the index. It examines recently added tombstones, translates the document numbers and writes a tombstone file against itself. Then it writes the snapshot file to commit its changes and releases the write lock.
    OK, now I understand KS's two-writer model. Lucene has already solved
    this with the ConcurrentMergeScheduler -- all segment merges are done
    in the BG (by default).

    We also have to compute the deletions against the new segment to
    include deletions that happened to the merged segments after the merge
    kicked off.

    Still, it's not a panacea since often the IO system has horrible
    degradation in performance while a merge is running. If only we could
    mark all IO (reads & writes) associated with merging as low priority
    and have the OS actually do the right thing...
    It's true that we are decoupling the process of making logical changes to the index from the process of internal consolidation. I probably wouldn't describe that as being done from the reader's standpoint, though.
    Right, we have a different problem in Lucene (because we must warm a
    reader before using it): after a large merge, warming the new
    IndexReader that includes that segment can be costly (though that cost
    is going down with LUCENE-1483, and eventually column-stride fields).

    But we can solve this by allowing a reopened reader to use the old
    segments, until the new segment is warmed.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedJan 5, '09 at 7:13p
activeJan 31, '09 at 12:24a
posts22
users6
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase