FAQ
Refactoring of IndexWriter
--------------------------

Key: LUCENE-2026
URL: https://issues.apache.org/jira/browse/LUCENE-2026
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
Fix For: 3.1


I've been thinking for a while about refactoring the IndexWriter into
two main components.

One could be called a SegmentWriter and as the
name says its job would be to write one particular index segment. The
default one just as today will provide methods to add documents and
flushes when its buffer is full.
Other SegmentWriter implementations would do things like e.g. appending or
copying external segments [what addIndexes*() currently does].

The second component's job would it be to manage writing the segments
file and merging/deleting segments. It would know about
DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
provide hooks that allow users to manage external data structures and
keep them in sync with Lucene's data during segment merges.

API wise there are things we have to figure out, such as where the
updateDocument() method would fit in, because its deletion part
affects all segments, whereas the new document is only being added to
the new segment.

Of course these should be lower level APIs for things like parallel
indexing and related use cases. That's why we should still provide
easy to use APIs like today for people who don't need to care about
per-segment ops during indexing. So the current IndexWriter could
probably keeps most of its APIs and delegate to the new classes.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • John Wang (JIRA) at Nov 4, 2009 at 1:27 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773329#action_12773329 ]

    John Wang commented on LUCENE-2026:
    -----------------------------------

    +1
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Nov 4, 2009 at 9:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773429#action_12773429 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    +1! IndexWriter has become immense.

    I think we should also pull out ReaderPool?
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael Busch (JIRA) at Nov 4, 2009 at 9:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12773432#action_12773432 ]

    Michael Busch commented on LUCENE-2026:
    ---------------------------------------

    {quote}
    I think we should also pull out ReaderPool?
    {quote}

    +1!
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot (JIRA) at Dec 10, 2009 at 6:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788838#action_12788838 ]

    Earwin Burrfoot commented on LUCENE-2026:
    -----------------------------------------

    We need an ability to see segment write (and probably deleted doc list write) as a discernible atomic operation. Right now it looks like several file writes, and we can't, say - redirect all files belonging to a certain segment to another Directory (well, in a simple manner). 'Something' should sit between a Directory (or several Directories) and IndexWriter.

    If we could do this, the current NRT search implementation will be largely obsoleted, innit? Just override the default impl of 'something' and send smaller segments to ram, bigger to disk, copy ram segments to disk asynchronously if we want to. Then we can use your granma's IndexReader and IndexWriter, totally decoupled from each other, and have blazing fast addDocument-commit-reopen turnaround.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot (JIRA) at Dec 10, 2009 at 6:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788840#action_12788840 ]

    Earwin Burrfoot commented on LUCENE-2026:
    -----------------------------------------

    Oh, forgive me if I just said something stupid :)
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 10, 2009 at 7:32 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788856#action_12788856 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    I think what you're describing is in fact the approach that LUCENE-1313 is taking; it's doing the switching internally between the main Dir & a private RAM Dir.

    But in my testing so far (LUCENE-2061), it doesn't seem like it'll help performance much. Ie, the OS generally seems to do a fine job putting those segments in RAM, itself. Ie, by maintaining a write cache. The weirdness is: that only holds true if you flush the segments when they are tiny (once per second, every 100 docs, in my test) -- not yet sure why that's the case. I'm going to re-run perf tests on a more mainstream OS (my tests are all OpenSolaris) and see if that strangeness still happens.

    But I think you still need to not do commit() during the reopen.

    I do think refactoring IW so that there is a separate component that keeps track of segments in the index, may simplify NRT, in that you can go to that source for your current "segments file" even if that segments file is uncommitted. In such a world you could do something like IndexReader.open(SegmentState) and it would be able to open (and, reopen) the real-time reader. It's just that it's seeing changes to the SegmentState done by the writer, even if they're not yet committed.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot (JIRA) at Dec 11, 2009 at 7:18 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789473#action_12789473 ]

    Earwin Burrfoot commented on LUCENE-2026:
    -----------------------------------------

    If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call. That sync call on memory-based Directory is noop.

    And no, you really should commit() to be able to see stuff on reopen() :) My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

    The original purporse of current NRT code was.. well.. let me remember.. NRT search! :) With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled??
    Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

    *: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 11, 2009 at 9:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789555#action_12789555 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    bq. If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call.

    I agree, per-segment searching was the most important step towards
    NRT. It's a great step forward...

    But the fsync call is a killer, so avoiding it in the NRT path is
    necessary. It's also very OS/FS dependent.

    bq. That sync call on memory-based Directory is noop.

    Until you need to spillover to disk because your RAM buffer is full?

    Also, if IW.commit() is called, I would expect any changes in RAM
    should be committed to the real dir (stable storage)?

    And, going through RAM first will necessarily be a hit on indexing
    throughput (Jake estimates 10% hit in Zoie's case). Really, our
    current approach goes through RAM as well, in that OS's write cache
    (if the machine has spare RAM) will quickly accept the small index
    files & write them in the BG. It's not clear we can do better than
    the OS here...

    bq. And no, you really should commit() to be able to see stuff on reopen() My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

    Uh, this is an API that clearly states that its purpose is to search
    the uncommitted changes. If you really want to be "pure"
    transactional, don't use this API ;)

    bq. The original purporse of current NRT code was.. well.. let me remember.. NRT search! With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

    In fact I favor our current approach because of its simplicity.

    Have a look at LUCENE-1313 (adds RAMDir as you're discussing), or,
    Zoie, which also adds the RAMDir and backgrounds resolving deleted
    docs -- they add complexity to Lucene that I don't think is warranted.

    My general feeling at this point is with per-segment searching, and
    fsync avoided, NRT performance is excellent.

    We've explored a number of possible tweaks to improve it --
    writing first to RAMDir (LUCENE-1313), resolving deletes in the
    foreground (LUCENE-2047), using paged BitVector for deletions
    (LUCENE-1526), Zoie (buffering segments in RAM & backgrounds resolving
    deletes), etc., but, based on testing so far, I don't see the
    justification for the added complexity.

    bq. *: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.

    This already runs in the BG by default. But warming the reader on the
    merged segment (before lighting it) is important (IW does this today).

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot (JIRA) at Dec 11, 2009 at 11:18 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789604#action_12789604 ]

    Earwin Burrfoot commented on LUCENE-2026:
    -----------------------------------------

    bq. Until you need to spillover to disk because your RAM buffer is full?
    No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

    Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

    Good commit() behaviour consists of two parts:
    1. Everything commit()ed is guaranteed to be on disk.
    2. Until commit() is called, reading threads don't see new/updated records.

    Now we want more speed, and are ready to sacrifice something if needed.
    You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

    I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

    In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future.
    So the process goes as:
    - You index docs, nobody sees them, nor deletions.
    - You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
    - Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crach, ram cache or not, OS/raid controller cache or not.

    For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

    Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds :)
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot (JIRA) at Dec 11, 2009 at 11:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789604#action_12789604 ]

    Earwin Burrfoot edited comment on LUCENE-2026 at 12/11/09 11:19 PM:
    --------------------------------------------------------------------

    bq. Until you need to spillover to disk because your RAM buffer is full?
    No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

    Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

    Good commit() behaviour consists of two parts:
    1. Everything commit()ed is guaranteed to be on disk.
    2. Until commit() is called, reading threads don't see new/updated records.

    Now we want more speed, and are ready to sacrifice something if needed.
    You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

    I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

    In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future.
    So the process goes as:
    - You index docs, nobody sees them, nor deletions.
    - You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
    - Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.

    For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

    Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds :)

    was (Author: earwin):
    bq. Until you need to spillover to disk because your RAM buffer is full?
    No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

    Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

    Good commit() behaviour consists of two parts:
    1. Everything commit()ed is guaranteed to be on disk.
    2. Until commit() is called, reading threads don't see new/updated records.

    Now we want more speed, and are ready to sacrifice something if needed.
    You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

    I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

    In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future.
    So the process goes as:
    - You index docs, nobody sees them, nor deletions.
    - You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
    - Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crach, ram cache or not, OS/raid controller cache or not.

    For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

    Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds :)
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 11, 2009 at 11:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789614#action_12789614 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    I say it's better to sacrifice write guarantee.
    I don't grok why sync is the default, especially given how sketchy hardware
    drivers are about obeying fsync:

    {panel}
    But, beware: some hardware devices may in fact cache writes even during
    fsync, and return before the bits are actually on stable storage, to give the
    appearance of faster performance.
    {panel}

    IMO, it should have been an option which defaults to false, to be enabled only by
    users who have the expertise to ensure that fsync() is actually doing what
    it advertises. But what's done is done (and Lucy will probably just do something
    different.)

    With regard to Lucene NRT, though, turning sync() off would really help. If and
    when some sort of settings class comes about, an enableSync(boolean enabled)
    method seems like it would come in handy.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jake Mannix (JIRA) at Dec 12, 2009 at 12:04 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789618#action_12789618 ]

    Jake Mannix commented on LUCENE-2026:
    -------------------------------------

    bq. Now we want more speed, and are ready to sacrifice something if needed.
    bq. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

    Chiming in here that of course, you don't *need* (ie there is a choice) to hack into the IW to do this. Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.

    The API for this kind of thing doesn't *have* to be tightly coupled, and I would agree with you that it shouldn't be.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 12, 2009 at 10:56 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789708#action_12789708 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    bq. Until you need to spillover to disk because your RAM buffer is full?

    No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.
    {quote}

    But this is where things start to get complex... the devil is in the
    details here. How do you carry over your deletes? This spillover
    will take time -- do you block all indexing while that's happening
    (not great)? Do you do it gradually (start spillover when half full,
    but still accept indexing)? Do you throttle things if index rate
    exceeds flush rate? How do you recover on exception?

    NRT today let's the OS's write cache decide how to use RAM to speed up
    writing of these small files, which keeps things alot simpler for us.
    I don't see why we should add complexity to Lucene to replicate what
    the OS is doing for us (NOTE: I don't really trust the OS in the
    reverse case... I do think Lucene should read into RAM the data
    structures that are important).

    bq. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

    bq. Now you don't have to hack into IW and write specialized readers.

    Probably we'll just have to disagree here... NRT isn't a hack ;)

    IW is already hanging onto completely normal segments. Ie, the index
    has been updated with these segments, just not yet published so
    outside readers can see it. All NRT does is let a reader see this
    private view.

    The readers that an NRT reader expoes are normal SegmentReaders --
    it's just that rather than consult a segments_N on disk to get the
    segment metadata, they pulled from IW's uncommitted in memory
    SegmentInfos instance.

    Yes we've talked about the "hot innards" solution -- an IndexReader
    impl that can directly search DW's ram buffer -- but that doesn't look
    necessary today, because performance of NRT is good with the simple
    solution we have now.

    NRT reader also gains performance by carrying over deletes in RAM. We
    should eventually do the same thing with norms & field cache. No
    reason to write to disk, then right away read again.

    {quote}
    * You index docs, nobody sees them, nor deletions.
    * You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
    * Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.
    {quote}

    But this is not a commit, if docs/deletes are written down into RAM?
    Ie, commit could return, then the machine could crash, and you've lost
    changes? Commit should go through to stable storage before returning?
    Maybe I'm just missing the big picture of what you're proposing
    here...

    Also, you can build all this out on top of Lucene today? Zoie is a
    proof point of this. (Actually: how does your proposal differ from
    Zoie? Maybe that'd help shed light...).

    bq. I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs.

    It is not that simple -- if you skip the fsync, and OS crashes/you
    lose power, your index can easily become corrupt. The resulting
    CheckIndex -fix can easily need to remove large segments.

    The OS's write cache makes no gurantees on the order in which the
    files you've written find their way to disk.

    Another option (we've discussed this) would be journal file approach
    (ie transaction log, like most DBs use). You only have one file to
    fsync, and you replay to recover. But that'd be a big change for
    Lucene, would add complexity, and can be accomplished outside of
    Lucene if an app really wants to...

    Let me try turning this around: in your componentization of
    SegmentReader, why does it matter who's tracking which components are
    needed to make up a given SR? In the IndexReader.open case, it's a
    SegmntInfos instance (obtained by loading segments_N file from disk).
    In the NRT case, it's also a SegmentInfos instace (the one IW is
    privately keeping track of and only publishing on commit). At the
    component level, creating the SegmentReader should be no different?

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 12, 2009 at 11:23 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789714#action_12789714 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    I say it's better to sacrifice write guarantee.
    I don't grok why sync is the default, especially given how sketchy hardware
    drivers are about obeying fsync:

    {panel}
    But, beware: some hardware devices may in fact cache writes even during
    fsync, and return before the bits are actually on stable storage, to give the
    appearance of faster performance.
    {panel}
    {quote}

    It's unclear how often this scare-warning is true in practice (scare
    warnings tend to spread very easily without concrete data); it's in
    the javadocs for completeness sake. I expect (though have no data to
    back this up...) that most OS/IO systems "out there" do properly
    implement fsync.

    {quote}
    IMO, it should have been an option which defaults to false, to be enabled only by
    users who have the expertise to ensure that fsync() is actually doing what
    it advertises. But what's done is done (and Lucy will probably just do something
    different.)
    {quote}

    I think that's a poor default (trades safety for performance), unless
    Lucy eg uses a transaction log so you can concretely bound what's lost
    on crash/power loss. Or, if you go back to autocommitting I guess...

    If we did this in Lucene, you can have unbounded corruption. It's not
    just the last few minutes of updates...

    So, I don't think we should even offer the option to turn it off. You
    can easily subclass your FSDir impl and make sync() a no-op if your
    really want to...

    {quote}
    With regard to Lucene NRT, though, turning sync() off would really help. If and
    when some sort of settings class comes about, an enableSync(boolean enabled)
    method seems like it would come in handy.
    {quote}

    You don't need to turn off sync for NRT -- that's the whole point. It
    gives you a reader without syncing the files. Really, this is your
    safety tradeoff -- it means you can commit less frequently, since the
    NRT reader can search the latest updates. But, your app has
    complete control over how it wants to to trade safety for performance.

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 12, 2009 at 11:31 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789716#action_12789716 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    bq. Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.

    Right, one can always not use NRT and build their own layers on top.

    But, Zoie has *alot* of code to accomplish this -- the devil really is
    in the details to "simply write first to a RAMDir". This is why I'd
    like Earwin to look @ Zoie and clarify his proposed approach, in
    contrast...

    Actually, here's a question: how quickly can Zoie turn around a
    commit()? Seems like it must take more time than Lucene, since it does
    extra stuff (flush RAM buffers to disk, materialize deletes) before
    even calling IW.commit.

    At the end of the day, any NRT system has to trade safety for
    performance (bypass the sync call in the NRT reader)....

    bq. The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be.

    I don't consider NRT today to be a tight coupling (eg, the pending
    refactoring of IW would nicely separate it out). If we implement the
    IR that searches DW's RAM buffer, then I'd agree ;)

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 13, 2009 at 3:39 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789905#action_12789905 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    I think that's a poor default (trades safety for performance), unless
    Lucy eg uses a transaction log so you can concretely bound what's lost
    on crash/power loss. Or, if you go back to autocommitting I guess...
    Search indexes should not be used for canonical data storage -- they should be
    built *on top of* canonical data storage. Guarding against power failure
    induced corruption in a database is an imperative. Guarding against power
    failure induced corruption in a search index is a feature, not an imperative.

    Users have many options for dealing with the potential for such corruption.
    You can go back to your canonical data store and rebuild your index from
    scratch when it happens. In a search cluster environment, you can rsync a
    known-good copy from another node. Potentially, you might enable
    fsync-before-commit and keep your own transaction log. However, if the time
    it takes to rebuild or recover an index from scratch would have caused you
    unacceptable downtime, you can't possibly be operating in a
    single-point-of-failure environment where a power failure could take you down
    anyway -- so other recovery options are available to you.

    Turning on fsync is only one step towards ensuring index integrity; others
    steps involve making decisions about hard drives, RAID arrays, failover
    strategies, network and off-site backups, etc, and are outside of our domain
    as library authors. We cannot meet the needs of users who need guaranteed
    index integrity on our own.

    For everybody else, what turning on fsync by default achieves is to make an
    exceedingly rare event rarer. That's valuable, but not essential. My
    argument is that since the search indexes should not be used for canonical
    storage, and since fsync is not testably reliable and not sufficient on its
    own, it's a good engineering compromise to prioritize performance.
    If we did this in Lucene, you can have unbounded corruption. It's not
    just the last few minutes of updates...
    Wasn't that a possibility under autocommit as well? All it takes is for the
    OS to finish flushing the new snapshot file to persistent storage before it
    finishes flushing a segment data file needed by that snapshot, and for the
    power failure to squeeze in between.

    In practice, locality of reference is going to make the window very very
    small, since those two pieces of data will usually get written very close to
    each other on the persistent media.

    I've seen a lot more messages to our user lists over the years about data
    corruption caused by bugs and misconfigurations than by power failures.

    But really, that's as it should be. Ensuring data integrity to the degree
    required by a database is costly -- it requires far more rigorous testing, and
    far more conservative development practices. If we accept that our indexes
    must *never* go corrupt, it will retard innovation.

    Of course we should work very hard to prevent index corruption. However, I'm
    much more concerned about stuff like silent omission of search results due to
    overzealous, overly complex optimizations than I am about problems arising
    from power failures. When a power failure occurs, you know it -- so you get
    the opportunity to fsck the disk, run checkIndex(), perform data integrity
    reconciliation tests against canonical storage, and if anything fails, take
    whatever recovery actions you deem necessary.
    You don't need to turn off sync for NRT - that's the whole point. It
    gives you a reader without syncing the files.
    I suppose this is where Lucy and Lucene differ. Thanks to mmap and the
    near-instantaneous reader opens it has enabled, we don't need to keep a
    special reader alive. Since there's no special reader, the only way to get
    data to a search process is to go through a commit. But if we fsync on every
    commit, we'll drag down indexing responsiveness. Fishishing the commit and
    returning control to client code as quickly as possible is a high priority for
    us.

    Furthermore, I don't want us to have to write the code to support a
    near-real-time reader hanging off of IndexWriter a la Lucene. The
    architectural discussions have made for very interesting reading, but the
    design seems to be tricky to pull off, and implementation simplicity in core
    search code is a high priority for Lucy. It's better for Lucy to kill two
    birds with one stone and concentrate on making *all* index opens fast.
    Really, this is your safety tradeoff - it means you can commit less
    frequently, since the NRT reader can search the latest updates. But, your
    app has complete control over how it wants to to trade safety for
    performance.
    So long as fsync is an option, the app always has complete control, regardless
    of whether the default setting is fsync or no fsync.

    If a Lucene app wanted to increase NRT responsiveness and throughput, and if
    absolute index integrity wasn't a concern because it had been addressed
    through other means (e.g. multi-node search cluster), would turning off fsync
    speed things up under any of the proposed designs?
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 14, 2009 at 2:52 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789971#action_12789971 ]

    Jason Rutherglen commented on LUCENE-2026:
    ------------------------------------------

    I think large scale NRT installations may eventually require a
    distributed transaction log. The implementation details have yet
    to be determined however it could potentially solve the issue of
    data loss being discussed. One candidate is a combo of Zookeeper
    + Bookeeper. I would venture to guess this could be implemented
    as a part of Solr, however we've got a lot of work to do for
    Solr to be reasonably NRT efficient (see the tracking issue
    SOLR-1606), and we're just starting on the Zookeeper
    implementation SOLR-1277...
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 15, 2009 at 10:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790988#action_12790988 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    bq. I think that's a poor default (trades safety for performance), unless Lucy eg uses a transaction log so you can concretely bound what's lost on crash/power loss. Or, if you go back to autocommitting I guess...

    Search indexes should not be used for canonical data storage - they should be
    built on top of canonical data storage.
    {quote}

    I agree with that, in theory, but I think in practice it's too
    idealistic to force/expect apps to meet that ideal.

    I expect for many apps it's a major cost to unexpectedly lose the
    search index on power loss / OS crash.

    {quote}
    Users have many options for dealing with the potential for such corruption.
    You can go back to your canonical data store and rebuild your index from
    scratch when it happens. In a search cluster environment, you can rsync a
    known-good copy from another node. Potentially, you might enable
    fsync-before-commit and keep your own transaction log. However, if the time
    it takes to rebuild or recover an index from scratch would have caused you
    unacceptable downtime, you can't possibly be operating in a
    single-point-of-failure environment where a power failure could take you down
    anyway - so other recovery options are available to you.

    Turning on fsync is only one step towards ensuring index integrity; others
    steps involve making decisions about hard drives, RAID arrays, failover
    strategies, network and off-site backups, etc, and are outside of our domain
    as library authors. We cannot meet the needs of users who need guaranteed
    index integrity on our own.
    {quote}

    Yes, high availability apps will already take their measures to
    protect the search index / recovery process, going beyond fsync.
    EG, making a hot backup of Lucene index is now straightforwarded.

    {quote}
    For everybody else, what turning on fsync by default achieves is to make an
    exceedingly rare event rarer. That's valuable, but not essential. My
    argument is that since the search indexes should not be used for canonical
    storage, and since fsync is not testably reliable and not sufficient on its
    own, it's a good engineering compromise to prioritize performance.
    {quote}

    Losing power to the machine, or OS crash, or the user doing a hard
    power down because OS isn't responding, I think are not actually
    *that* uncommon in an end user setting. Think of a desktop app
    embedding Lucene/Lucy...

    {quote}
    bq. If we did this in Lucene, you can have unbounded corruption. It's not just the last few minutes of updates...

    Wasn't that a possibility under autocommit as well? All it takes is for the
    OS to finish flushing the new snapshot file to persistent storage before it
    finishes flushing a segment data file needed by that snapshot, and for the
    power failure to squeeze in between.
    {quote}

    Not after LUCENE-1044... autoCommit simply called commit() at certain
    opportune times (after finish big merges), which does the right thing
    (I hope!). The segments file is not written until all files it
    references are sync'd.

    {quote}
    In practice, locality of reference is going to make the window very very
    small, since those two pieces of data will usually get written very close to
    each other on the persistent media.
    {quote}

    Not sure about that -- it depends on how effectively the OS's write cache
    "preserves" that locality.

    {quote}
    I've seen a lot more messages to our user lists over the years about data
    corruption caused by bugs and misconfigurations than by power failures.
    {quote}

    I would agree, though, I think it may be a sampling problem... ie
    people whose machines crashed and they lost the search index would
    often not raise it on the list (vs say a persistent config issue that keeps
    leading to corruption).

    {quote}
    But really, that's as it should be. Ensuring data integrity to the degree
    required by a database is costly - it requires far more rigorous testing, and
    far more conservative development practices. If we accept that our indexes
    must never go corrupt, it will retard innovation.
    {quote}

    It's not really that costly, with NRT -- you can get a searcher on the
    index without paying the commit cost. And now you can call commit
    however frequently you need to. Quickly turning around a new
    searcher, and how frequently you commit, are now independent.

    Also, having the app explicitly decouple these two notions keeps the
    door open for future improvements. If we force absolutely all sharing
    to go through the filesystem then that limits the improvements we can
    make to NRT.

    {quote}
    Of course we should work very hard to prevent index corruption. However, I'm
    much more concerned about stuff like silent omission of search results due to
    overzealous, overly complex optimizations than I am about problems arising
    from power failures. When a power failure occurs, you know it - so you get
    the opportunity to fsck the disk, run checkIndex(), perform data integrity
    reconciliation tests against canonical storage, and if anything fails, take
    whatever recovery actions you deem necessary.
    {quote}

    Well... I think search performance is important, and we should pursue it
    even if we risk bugs.

    {quote}
    bq. You don't need to turn off sync for NRT - that's the whole point. It gives you a reader without syncing the files.

    I suppose this is where Lucy and Lucene differ. Thanks to mmap and the
    near-instantaneous reader opens it has enabled, we don't need to keep a
    special reader alive. Since there's no special reader, the only way to get
    data to a search process is to go through a commit. But if we fsync on every
    commit, we'll drag down indexing responsiveness. Fishishing the commit and
    returning control to client code as quickly as possible is a high priority for
    us.
    {quote}

    NRT reader isn't that special -- the only things different is 1) it
    loaded the segments_N "file" from IW instead of the filesystem, and 2)
    it uses a reader pool to "share" the underlying SegmentReaders with
    other places that have loaded them. I guess, if Lucy won't allow
    this, then, yes, forcing a commit in order to reopen is very costly,
    and so sacrificing safety is a tradeoff you have to make.

    Alternatively, you could keep the notion "flush" (an unsafe commit)
    alive? You write the segments file, but make no effort to ensure it's
    durability (and also preserve the last "true" commit). Then a normal
    IR.reopen suffices...

    {quote}
    Furthermore, I don't want us to have to write the code to support a
    near-real-time reader hanging off of IndexWriter a la Lucene. The
    architectural discussions have made for very interesting reading, but the
    design seems to be tricky to pull off, and implementation simplicity in core
    search code is a high priority for Lucy. It's better for Lucy to kill two
    birds with one stone and concentrate on making all index opens fast.
    {quote}

    But shouldn't you at least give an option for index durability? Even
    if we disagree about the default?

    {quote}
    bq. Really, this is your safety tradeoff - it means you can commit less frequently, since the NRT reader can search the latest updates. But, your app has complete control over how it wants to to trade safety for performance.

    So long as fsync is an option, the app always has complete control,
    regardless of whether the default setting is fsync or no fsync.
    {quote}

    Well it is an "option" in Lucene -- "it's just software" ;) I don't
    want to make it easy to be unsafe. Lucene shouldn't sacrifice safety
    of the index... and with NRT there's no need to make that tradeoff.

    {quote}
    If a Lucene app wanted to increase NRT responsiveness and throughput, and if
    absolute index integrity wasn't a concern because it had been addressed
    through other means (e.g. multi-node search cluster), would turning off fsync
    speed things up under any of the proposed designs?
    {quote}

    Yes, turning off fsync would speed things up -- you could fall back to
    simple reopen and get good performance (NRT should still be faster
    since the readers are pooled). The "use RAMDir on top of Lucene"
    designs would be helped less since fsync is a noop in RAMDir.

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 16, 2009 at 7:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791549#action_12791549 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    Wasn't that a possibility under autocommit as well? All it takes is for the
    OS to finish flushing the new snapshot file to persistent storage before it
    finishes flushing a segment data file needed by that snapshot, and for the
    power failure to squeeze in between.
    Not after LUCENE-1044... autoCommit simply called commit() at certain
    opportune times (after finish big merges), which does the right thing (I
    hope!). The segments file is not written until all files it references are
    sync'd.
    FWIW, autoCommit doesn't really have a place in Lucy's
    one-segment-per-indexing-session model.

    Revisiting the LUCENE-1044 threads, one passage stood out:

    {panel}
    http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

    This is why in a db system, the only file that is sync'd is the log
    file - all other files can be made "in sync" from the log file - and
    this file is normally striped for optimum write performance. Some
    systems have special "log file drives" (some even solid state, or
    battery backed ram) to aid the performance.
    {panel}

    The fact that we have to sync all files instead of just one seems sub-optimal.

    Yet Lucene is not well set up to maintain a transaction log. The very act of
    adding a document to Lucene is inherently lossy even if all fields are stored,
    because doc boost is not preserved.
    Also, having the app explicitly decouple these two notions keeps the
    door open for future improvements. If we force absolutely all sharing
    to go through the filesystem then that limits the improvements we can
    make to NRT.
    However, Lucy has much more to gain going through the file system than Lucene
    does, because we don't necessarily incur JVM startup costs when launching a
    new process. The Lucene approach to NRT -- specialized reader hanging off of
    writer -- is constrained to a single process. The Lucy approach -- fast index
    opens enabled by mmap-friendly index formats -- is not.

    The two approaches aren't mutually exclusive. It will be possible to augment
    Lucy with a specialized index reader within a single process. However, A)
    there seems to be a lot of disagreement about just how to integrate that
    reader, and B) there seem to be ways to bolt that functionality on top of the
    existing classes. Under those circumstances, I think it makes more sense to
    keep that feature external for now.
    Alternatively, you could keep the notion "flush" (an unsafe commit)
    alive? You write the segments file, but make no effort to ensure it's
    durability (and also preserve the last "true" commit). Then a normal
    IR.reopen suffices...
    That sounds promising. The semantics would differ from those of Lucene's
    flush(), which doesn't make changes visible.

    We could implement this by somehow marking a "committed" snapshot and a
    "flushed" snapshot differently, either by adding an "fsync" property to the
    snapshot file that would be false after a flush() but true after a commit(),
    or by encoding the property within the snapshot filename. The file purger
    would have to ensure that all index files referenced by either the last
    committed snapshot or the last flushed snapshot were off limits. A rollback()
    would zap all changes since the last commit().

    Such a scheme allows the the top level app to avoid the costs of fsync while
    maintaining its own transaction log -- perhaps with the optimizations
    suggested above (separate disk, SSD, etc).
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 17, 2009 at 2:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791936#action_12791936 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    FWIW, autoCommit doesn't really have a place in Lucy's
    one-segment-per-indexing-session model.
    {quote}

    Well, autoCommit just means "periodically call commit". So, if you
    decide to offer a commit() operation, then autoCommit would just wrap
    that? But, I don't think autoCommit should be offered... app should
    decide.

    {quote}
    Revisiting the LUCENE-1044 threads, one passage stood out:

    http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

    This is why in a db system, the only file that is sync'd is the log
    file - all other files can be made "in sync" from the log file - and
    this file is normally striped for optimum write performance. Some
    systems have special "log file drives" (some even solid state, or
    battery backed ram) to aid the performance.

    The fact that we have to sync all files instead of just one seems sub-optimal.
    {quote}

    Yes, but, that cost is not on the reopen path, so it's much less
    important. Ie, the app can freely choose how frequently it wants to
    commit, completely independent from how often it needs to reopen.

    {quote}
    Yet Lucene is not well set up to maintain a transaction log. The very act of
    adding a document to Lucene is inherently lossy even if all fields are stored,
    because doc boost is not preserved.
    {quote}

    I don't see that those two statements are related.

    One can "easily" (meaning, it's easily decoupled from core) make a
    transaction log on top of lucene -- just serialize your docs/analzyer
    selection/etc to the log & sync it periodically.

    But, that's orthogonal to what Lucene does & doesn't preserve in its
    index (and, yes, Lucene doesn't precisely preserve boosts).

    {quote}
    bq. Also, having the app explicitly decouple these two notions keeps the door open for future improvements. If we force absolutely all sharing to go through the filesystem then that limits the improvements we can make to NRT.

    However, Lucy has much more to gain going through the file system than Lucene
    does, because we don't necessarily incur JVM startup costs when launching a
    new process. The Lucene approach to NRT - specialized reader hanging off of
    writer - is constrained to a single process. The Lucy approach - fast index
    opens enabled by mmap-friendly index formats - is not.

    The two approaches aren't mutually exclusive. It will be possible to augment
    Lucy with a specialized index reader within a single process. However, A)
    there seems to be a lot of disagreement about just how to integrate that
    reader, and B) there seem to be ways to bolt that functionality on top of the
    existing classes. Under those circumstances, I think it makes more sense to
    keep that feature external for now.
    {quote}

    Again: NRT is not a "specialized reader". It's a normal read-only
    DirectoryReader, just like you'd get from IndexReader.open, with the
    only difference being that it consulted IW to find which segments to
    open. Plus, it's pooled, so that if IW already has a given segment
    reader open (say because deletes were applied or merges are running),
    it's reused.

    We've discussed making it specialized (eg directly asearching DW's ram
    buffer, caching recently flushed segments in RAM, special
    incremental-copy-on-write data structures for deleted docs, etc.) but
    so far these changes don't seem worthwhile.

    The current approach to NRT is simple... I haven't yet seen
    performance gains strong enough to justify moving to "specialized
    readers".

    Yes, Lucene's approach must be in the same JVM. But we get important
    gains from this -- reusing a single reader (the pool), carrying over
    merged deletions directly in RAM (and eventually field cache & norms
    too -- LUCENE-1785).

    Instead, Lucy (by design) must do all sharing & access all index data
    through the filesystem (a decision, I think, could be dangerous),
    which will necessarily increase your reopen time. Maybe in practice
    that cost is small though... the OS write cache should keep everything
    fresh... but you still must serialize.

    {quote}
    bq. Alternatively, you could keep the notion "flush" (an unsafe commit) alive? You write the segments file, but make no effort to ensure it's durability (and also preserve the last "true" commit). Then a normal IR.reopen suffices...

    That sounds promising. The semantics would differ from those of Lucene's
    flush(), which doesn't make changes visible.

    We could implement this by somehow marking a "committed" snapshot and a
    "flushed" snapshot differently, either by adding an "fsync" property to the
    snapshot file that would be false after a flush() but true after a commit(),
    or by encoding the property within the snapshot filename. The file purger
    would have to ensure that all index files referenced by either the last
    committed snapshot or the last flushed snapshot were off limits. A rollback()
    would zap all changes since the last commit().

    Such a scheme allows the the top level app to avoid the costs of fsync while
    maintaining its own transaction log - perhaps with the optimizations
    suggested above (separate disk, SSD, etc).
    {quote}

    In fact, this would make Lucy's approach to NRT nearly identical to
    Lucene NRT.

    The only difference is, instead of getting the current uncommitted
    segments_N via RAM, Lucy uses the filesystem. And, of course
    Lucy doesn't pool readers. So this is really a Lucy-ification of
    Lucene's approach to NRT.

    So it has the same benefits as Lucene's NRT, ie, lets Lucy apps
    decouple decisions about safety (commit) and freshness (reopen
    turnaround time).

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 18, 2009 at 8:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792625#action_12792625 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    Well, autoCommit just means "periodically call commit". So, if you
    decide to offer a commit() operation, then autoCommit would just wrap
    that? But, I don't think autoCommit should be offered... app should
    decide.
    Agreed, autoCommit had benefits under legacy Lucene, but wouldn't be important
    now. If we did add some sort of "automatic commit" feature, it would mean
    something else: commit every change instantly. But that's easy to implement
    via a wrapper, so there's no point cluttering the the primary index writer
    class to support such a feature.
    Again: NRT is not a "specialized reader". It's a normal read-only
    DirectoryReader, just like you'd get from IndexReader.open, with the
    only difference being that it consulted IW to find which segments to
    open. Plus, it's pooled, so that if IW already has a given segment
    reader open (say because deletes were applied or merges are running),
    it's reused.
    Well, it seems to me that those two features make it special -- particularly
    the pooling of SegmentReaders. You can't take advantage of that outside the
    context of IndexWriter:
    Yes, Lucene's approach must be in the same JVM. But we get important
    gains from this - reusing a single reader (the pool), carrying over
    merged deletions directly in RAM (and eventually field cache & norms
    too - LUCENE-1785).
    Exactly. In my view, that's what makes that reader "special": unlike ordinary
    Lucene IndexReaders, this one springs into being with its caches already
    primed rather than in need of lazy loading.

    But to achieve those benefits, you have to mod the index writing process.
    Those modifications are not necessary under the Lucy model, because the mere
    act of writing the index stores our data in the system IO cache.
    Instead, Lucy (by design) must do all sharing & access all index data
    through the filesystem (a decision, I think, could be dangerous),
    which will necessarily increase your reopen time.
    Dangerous in what sense?

    Going through the file system is a tradeoff, sure -- but it's pretty nice to
    design your low-latency search app free from any concern about whether
    indexing and search need to be coordinated within a single process.
    Furthermore, if separate processes are your primary concurrency model, going
    through the file system is actually mandatory to achieve best performance on a
    multi-core box. Lucy won't always be used with multi-threaded hosts.

    I actually think going through the file system is dangerous in a different
    sense: it puts pressure on the file format spec. The easy way to achieve IPC
    between writers and readers will be to dump stuff into one of the JSON files
    to support the killer-feature-du-jour -- such as what I'm proposing with this
    "fsync" key in the snapshot file. But then we wind up with a bunch of crap
    cluttering up our index metadata files. I'm determined that Lucy will have a
    more coherent file format than Lucene, but with this IPC requirement we're
    setting our community up to push us in the wrong direction. If we're not
    careful, we could end up with a file format that's an unmaintainable jumble.

    But you're talking performance, not complexity costs, right?
    Maybe in practice that cost is small though... the OS write cache should
    keep everything fresh... but you still must serialize.
    Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
    and 900 MB worth of sort cache data; opening a fresh searcher and loading all
    sort caches takes circa 21 ms.

    There's room to improve that further -- we haven't yet implemented
    IndexReader.reopen() -- but that was fast enough to achieve what we wanted to
    achieve.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Dec 18, 2009 at 8:29 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792629#action_12792629 ]

    Jason Rutherglen commented on LUCENE-2026:
    ------------------------------------------

    {quote}Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
    and 900 MB worth of sort cache data; opening a fresh searcher and loading all
    sort caches takes circa 21 ms.{quote}

    Marvin, very cool! Are you using the mmap module you mentioned at ApacheCon?
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 18, 2009 at 8:52 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792638#action_12792638 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------

    Yes, this is using the sort cache model worked out this spring on lucy-dev.
    The memory mapping happens within FSFileHandle (LUCY-83). SortWriter
    and SortReader haven't made it into the Lucy repository yet.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 19, 2009 at 12:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792713#action_12792713 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    bq. Again: NRT is not a "specialized reader". It's a normal read-only DirectoryReader, just like you'd get from IndexReader.open, with the only difference being that it consulted IW to find which segments to open. Plus, it's pooled, so that if IW already has a given segment reader open (say because deletes were applied or merges are running), it's reused.

    Well, it seems to me that those two features make it special - particularly
    the pooling of SegmentReaders. You can't take advantage of that outside the
    context of IndexWriter:
    {quote}

    OK so mabye a little special ;) But, really that pooling should be
    factored out of IW. It's not writer specific.

    {quote}
    bq. Yes, Lucene's approach must be in the same JVM. But we get important gains from this - reusing a single reader (the pool), carrying over merged deletions directly in RAM (and eventually field cache & norms too - LUCENE-1785).

    Exactly. In my view, that's what makes that reader "special": unlike ordinary
    Lucene IndexReaders, this one springs into being with its caches already
    primed rather than in need of lazy loading.

    But to achieve those benefits, you have to mod the index writing process.
    {quote}

    Mod the index writing, and the reader reopen, to use the shared pool.
    The pool in itself isn't writer specific.

    Really the pool is just like what you tap into when you call reopen --
    that method looks at the current "pool" of already opened segments,
    sharing what it can.

    bq. Those modifications are not necessary under the Lucy model, because the mere act of writing the index stores our data in the system IO cache.

    But, that's where Lucy presumably takes a perf hit. Lucene can share
    these in RAM, not usign the filesystem as the intermediary (eg we do
    that today with deletions; norms/field cache/eventual CSF can do the
    same.) Lucy must go through the filesystem to share.

    {quote}
    bq. Instead, Lucy (by design) must do all sharing & access all index data through the filesystem (a decision, I think, could be dangerous), which will necessarily increase your reopen time.

    Dangerous in what sense?

    Going through the file system is a tradeoff, sure - but it's pretty nice to
    design your low-latency search app free from any concern about whether
    indexing and search need to be coordinated within a single process.
    Furthermore, if separate processes are your primary concurrency model, going
    through the file system is actually mandatory to achieve best performance on a
    multi-core box. Lucy won't always be used with multi-threaded hosts.

    I actually think going through the file system is dangerous in a different
    sense: it puts pressure on the file format spec. The easy way to achieve IPC
    between writers and readers will be to dump stuff into one of the JSON files
    to support the killer-feature-du-jour - such as what I'm proposing with this
    "fsync" key in the snapshot file. But then we wind up with a bunch of crap
    cluttering up our index metadata files. I'm determined that Lucy will have a
    more coherent file format than Lucene, but with this IPC requirement we're
    setting our community up to push us in the wrong direction. If we're not
    careful, we could end up with a file format that's an unmaintainable jumble.

    But you're talking performance, not complexity costs, right?
    {quote}

    Mostly I was thinking performance, ie, trusting the OS to make good
    decisions about what should be RAM resident, when it has limited
    information...

    But, also risky is that all important data structures must be
    "file-flat", though in practice that doesn't seem like an issue so
    far? The RAM resident things Lucene has -- norms, deleted docs, terms
    index, field cache -- seem to "cast" just fine to file-flat. If we
    switched to an FST for the terms index I guess that could get
    tricky...

    Wouldn't shared memory be possible for process-only concurrent models?
    Also, what popular systems/environments have this requirement (only
    process level concurrency) today?

    It's wonderful that Lucy can startup really fast, but, for most apps
    that's not nearly as important as searching/indexing performance,
    right? I mean, you start only once, and then you handle many, many
    searches / index many documents, with that process, usually?

    {quote}
    bq. Maybe in practice that cost is small though... the OS write cache should keep everything fresh... but you still must serialize.

    Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records
    and 900 MB worth of sort cache data; opening a fresh searcher and loading all
    sort caches takes circa 21 ms.
    {quote}

    That's fabulously fast!

    But you really need to also test search/indexing throughput, reopen time
    (I think) once that's online for Lucy...

    {quote}
    There's room to improve that further - we haven't yet implemented
    IndexReader.reopen() - but that was fast enough to achieve what we wanted to
    achieve.
    {quote}

    Is reopen even necessary in Lucy?
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 20, 2009 at 2:10 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792939#action_12792939 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    But, that's where Lucy presumably takes a perf hit. Lucene can share
    these in RAM, not usign the filesystem as the intermediary (eg we do
    that today with deletions; norms/field cache/eventual CSF can do the
    same.) Lucy must go through the filesystem to share.
    For a flush(), I don't think there's a significant penalty. The only extra
    costs Lucy will pay are the bookkeeping costs to update the file system state
    and to create the objects that read the index data. Those are real, but since
    we're skipping the fsync(), they're small. As far as the actual data, I don't
    see that there's a difference. Reading from memory mapped RAM isn't any
    slower than reading from malloc'd RAM.

    If we have to fsync(), there'll be a cost, but in Lucene you have to pay that
    same cost, too. Lucene expects to get around it with IndexWriter.getReader().
    In Lucy, we'll get around it by having you call flush() and then reopen a
    reader somewhere, often in another proecess.

    * In both cases, the availability of fresh data is decoupled from the fsync.
    * In both cases, the indexing process has to be careful about dropping data
    on the floor before a commit() succeeds.
    * In both cases, it's possible to protect against unbounded corruption by
    rolling back to the last commit.
    Mostly I was thinking performance, ie, trusting the OS to make good
    decisions about what should be RAM resident, when it has limited
    information...
    Right, for instance because we generally can't force the OS to pin term
    dictionaries in RAM, as discussed a while back. It's not an ideal situation,
    but Lucene's approach isn't bulletproof either, since Lucene's term
    dictionaries can get paged out too.

    We're sure not going to throw away all the advantages of mmap and go back to
    reading data structures into process RAM just because of that.
    But, also risky is that all important data structures must be "file-flat",
    though in practice that doesn't seem like an issue so far?
    It's a constraint. For instance, to support mmap, string sort caches
    currently require three "files" each: ords, offsets, and UTF-8 character data.

    The compound file system makes the file proliferation bearable, though. And
    it's actually nice in a way to have data structures as named files, strongly
    separated from each other and persistent.

    If we were willing to ditch portability, we could cast to arrays of structs in
    Lucy -- but so far we've just used primitives. I'd like to keep it that way,
    since it would be nice if the core Lucy file format was at least theoretically
    compatible with a pure Java implementation. But Lucy plugins could break that
    rule and cast to structs if desired.
    The RAM resident things Lucene has - norms, deleted docs, terms index, field
    cache - seem to "cast" just fine to file-flat.
    There are often benefits to keeping stuff "file-flat", particularly when the
    file-flat form is compressed. If we were to expand those sort caches to
    string objects, they'd take up more RAM than they do now.

    I think the only significant drawback is security: we can't trust memory
    mapped data the way we can data which has been read into process RAM and
    checked on the way in. For instance, we need to perform UTF-8 sanity checking
    each time a string sort cache value escapes the controlled environment of the
    cache reader. If the sort cache value was instead derived from an existing
    string in process RAM, we wouldn't need to check it.
    If we switched to an FST for the terms index I guess that could get
    tricky...
    Hmm, I haven't been following that. Too much work to keep up with those
    giganto patches for flex indexing, even though it's a subject I'm intimately
    acquainted with and deeply interested in. I plan to look it over when you're
    done and see if we can simplify it. :)
    Wouldn't shared memory be possible for process-only concurrent models?
    IPC is a platform-compatibility nightmare. By restricting ourselves to
    communicating via the file system, we save ourselves oodles of engineering
    time. And on really boring, frustrating work, to boot.
    Also, what popular systems/environments have this requirement (only process
    level concurrency) today?
    Perl's threads suck. Actually all threads suck. Perl's are just worse than
    average -- and so many Perl binaries are compiled without them. Java threads
    suck less, but they still suck -- look how much engineering time you folks
    blow on managing that stuff. Threads are a terrible programming model.

    I'm not into the idea of forcing Lucy users to use threads. They should be
    able to use processes as their primary concurrency model if they want.
    It's wonderful that Lucy can startup really fast, but, for most apps that's
    not nearly as important as searching/indexing performance, right?
    Depends.

    Total indexing throughput in both Lucene and KinoSearch has been pretty decent
    for a long time. However, there's been a large gap between average index
    update performance and worst case index update performance, especially when
    you factor in sort cache loading. There are plenty of applications that may
    not have very high throughput requirements but where it may not be acceptable
    for an index update to take several seconds or several minutes every once in a
    while, even if it usually completes faster.
    I mean, you start only once, and then you handle many, many
    searches / index many documents, with that process, usually?
    Sometimes the person who just performed the action that updated the index is
    the only one you care about. For instance, to use a feature request that came
    in from Slashdot a while back, if someone leaves a comment on your website,
    it's nice to have it available in the search index right away.

    Consistently fast index update responsiveness makes personalization of the
    customer experience easier.
    But you really need to also test search/indexing throughput, reopen time
    (I think) once that's online for Lucy...
    Naturally.
    Is reopen even necessary in Lucy?
    Probably. If you have a boatload of segments and a boatload of fields, you
    might start to see file opening and metadata parsing costs come into play. If
    it turns out that for some indexes reopen() can knock down the time from say,
    100 ms to 10 ms or less, I'd consider that sufficient justification.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 20, 2009 at 3:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792996#action_12792996 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    bq. But, that's where Lucy presumably takes a perf hit. Lucene can share these in RAM, not usign the filesystem as the intermediary (eg we do that today with deletions; norms/field cache/eventual CSF can do the same.) Lucy must go through the filesystem to share.

    For a flush(), I don't think there's a significant penalty. The only extra
    costs Lucy will pay are the bookkeeping costs to update the file system state
    and to create the objects that read the index data. Those are real, but since
    we're skipping the fsync(), they're small. As far as the actual data, I don't
    see that there's a difference.
    {quote}

    But everything must go through the filesystem with Lucy...

    Eg, with Lucene, deletions are not written to disk until you commit.
    Flush doesn't write the del file, merging doesn't, etc. The deletes
    are carried in RAM. We could (but haven't yet -- NRT turnaround time
    is already plenty fast) do the same with norms, field cache, terms
    dict index, etc.

    {quote}
    Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM.

    Right, for instance because we generally can't force the OS to pin term
    dictionaries in RAM, as discussed a while back. It's not an ideal situation,
    but Lucene's approach isn't bulletproof either, since Lucene's term
    dictionaries can get paged out too.
    {quote}

    As long as the page is hot... (in both cases!).

    But by using file-backed RAM (not malloc'd RAM), you're telling the OS
    it's OK if it chooses to swap it out. Sure, malloc'd RAM can be
    swapped out too... but that should be less frequent (and, we can
    control this behavior, somewhat, eg swappiness).

    It's similar to using a weak v strong reference in java. By using
    file-backed RAM you tell the OS it's fair game for swapping.

    {quote}
    If we have to fsync(), there'll be a cost, but in Lucene you have to pay that
    same cost, too. Lucene expects to get around it with IndexWriter.getReader().
    In Lucy, we'll get around it by having you call flush() and then reopen a
    reader somewhere, often in another proecess.

    In both cases, the availability of fresh data is decoupled from the fsync.
    In both cases, the indexing process has to be careful about dropping data
    on the floor before a commit() succeeds.
    In both cases, it's possible to protect against unbounded corruption by
    rolling back to the last commit.
    {quote}

    The two approaches are basically the same, so, we get the same
    features ;)

    It's just that Lucy uses the filesystem for sharing, and Lucene shares
    through RAM.

    bq. We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that.

    I guess my confusion is what are all the other benefits of using
    file-backed RAM? You can efficiently use process only concurrency
    (though shared memory is technically an option for this too), and you
    have wicked fast open times (but, you still must warm, just like
    Lucene). What else? Oh maybe the ability to inform OS *not* to cache
    eg the reads done when merging segments. That's one I sure wish
    Lucene could use...

    In exchange you risk the OS making poor choices about what gets
    swapped out (LRU policy is too simplistic... not all pages are created
    equal), must down cast all data structures to file-flat, must share
    everything through the filesystem, (perf hit to NRT).

    I do love how pure the file-backed RAM approach is, but I worry that
    down the road it'll result in erratic search performance in certain
    app profiles.

    {quote}
    bq. But, also risky is that all important data structures must be "file-flat", though in practice that doesn't seem like an issue so far?

    It's a constraint. For instance, to support mmap, string sort caches
    currently require three "files" each: ords, offsets, and UTF-8 character data.
    {quote}

    Yeah, that you need 3 files for the string sort cache is a little
    spooky... that's 3X the chance of a page fault.

    {quote}
    The compound file system makes the file proliferation bearable, though. And
    it's actually nice in a way to have data structures as named files, strongly
    separated from each other and persistent.
    {quote}

    But the CFS construction must also go through the filesystem (like
    Lucene) right? So you still incur IO load of creating the small
    files, then 2nd pass to consolidate.

    I agree there's a certain design purity to having the files clearly
    separate out the elements of the data structures, but if it means
    erratic search performance... function over form?

    {quote}
    If we were willing to ditch portability, we could cast to arrays of structs in
    Lucy - but so far we've just used primitives. I'd like to keep it that way,
    since it would be nice if the core Lucy file format was at least theoretically
    compatible with a pure Java implementation. But Lucy plugins could break that
    rule and cast to structs if desired.
    {quote}

    Someday we could make a Lucene codec that interacts with a Lucy
    index... would be a good exercise to go though to see if the flex API
    really is "flex" enough...

    {quote}
    bq. The RAM resident things Lucene has - norms, deleted docs, terms index, field cache - seem to "cast" just fine to file-flat.

    There are often benefits to keeping stuff "file-flat", particularly when the
    file-flat form is compressed. If we were to expand those sort caches to
    string objects, they'd take up more RAM than they do now.
    {quote}

    We've leaving them as UTF8 by default for Lucene (with the flex
    changes). Still, the terms index once loaded does have silly RAM
    overhead... we can cut that back a fair amount though.

    {quote}
    I think the only significant drawback is security: we can't trust memory
    mapped data the way we can data which has been read into process RAM and
    checked on the way in. For instance, we need to perform UTF-8 sanity checking
    each time a string sort cache value escapes the controlled environment of the
    cache reader. If the sort cache value was instead derived from an existing
    string in process RAM, we wouldn't need to check it.
    {quote}

    Sigh, that's a curious downside... so term decode intensive uses
    (merging, range queries, I guess maybe term dict lookup) take the
    brunt of that hit?

    {quote}
    bq. If we switched to an FST for the terms index I guess that could get tricky...

    Hmm, I haven't been following that.
    {quote}

    There's not much to follow -- it's all just talk at this point. I
    don't think anyone's built a prototype yet ;)

    {quote}
    Too much work to keep up with those giganto patches for flex indexing,
    even though it's a subject I'm intimately acquainted with and deeply
    interested in. I plan to look it over when you're done and see if we
    can simplify it.
    {quote}

    And then we'll borrow back your simplifications ;) Lather, rinse,
    repeat.

    {quote}
    bq. Wouldn't shared memory be possible for process-only concurrent models?

    IPC is a platform-compatibility nightmare. By restricting ourselves to
    communicating via the file system, we save ourselves oodles of
    engineering time. And on really boring, frustrating work, to boot.
    {quote}

    I had assumed so too, but I was surprised that Python's
    multiprocessing module exposes a simple API for sharing objects from
    parent to forked child. It's at least a counter example (though, in
    all fairness, I haven't looked at the impl ;) ), ie, there seems to be
    some hope of containing shared memory under a consistent API.

    I'm just pointing out that "going through the filesystem" isn't the
    only way to have efficient process-only concurrency. Shared memory
    is another option, but, yes it has tradeoffs.

    {quote}

    bq. Also, what popular systems/environments have this requirement (only process level concurrency) today?

    Perl's threads suck. Actually all threads suck. Perl's are just worse than
    average - and so many Perl binaries are compiled without them. Java threads
    suck less, but they still suck - look how much engineering time you folks
    blow on managing that stuff. Threads are a terrible programming model.

    I'm not into the idea of forcing Lucy users to use threads. They should be
    able to use processes as their primary concurrency model if they want.
    {quote}

    Yes, working with threads is a nightmare (eg have a look at Java's
    memory model). I think the jury is still out (for our species) just
    how, long term, we'll make use of concurrency with the machines. I
    think we may need to largely take "time" out of our programming
    languages, eg switch to much more declarative code, or
    something... wanna port Lucy to Erlang?

    But I'm not sure process only concurrency, sharing only via
    file-backed memory, is the answer either ;)

    {quote}
    bq. It's wonderful that Lucy can startup really fast, but, for most apps that's not nearly as important as searching/indexing performance, right?

    Depends.

    Total indexing throughput in both Lucene and KinoSearch has been pretty decent
    for a long time. However, there's been a large gap between average index
    update performance and worst case index update performance, especially when
    you factor in sort cache loading. There are plenty of applications that may
    not have very high throughput requirements but where it may not be acceptable
    for an index update to take several seconds or several minutes every once in a
    while, even if it usually completes faster.

    bq. I mean, you start only once, and then you handle many, many searches / index many documents, with that process, usually?

    Sometimes the person who just performed the action that updated the index is
    the only one you care about. For instance, to use a feature request that came
    in from Slashdot a while back, if someone leaves a comment on your website,
    it's nice to have it available in the search index right away.

    Consistently fast index update responsiveness makes personalization of the
    customer experience easier.
    {quote}

    Turnaround time for Lucene NRT is already very fast, as is. After an
    immense merge, it'll be the worst, but if you warm the reader first,
    that won't be an issue.

    Using Zoie you can make reopen time insanely fast (much faster than I
    think necessary for most apps), but at the expense of some expected
    hit to searching/indexing throughput. I don't think that's the right
    tradeoff for Lucene.

    I suspect Lucy is making a similar tradeoff, ie, that search
    performance will be erratic due to page faults, at a smallish gain in
    reopen time.

    Do you have any hard numbers on how much time it takes Lucene to load
    from a hot IO cache, populating its RAM resident data structures? I
    wonder in practice what extra cost we are really talking about... it's
    RAM to RAM "translation" of data structures (if the files are hot).
    FieldCache we just have to fix to stop doing uninversion... (ie we
    need CSF).

    {quote}
    bq. Is reopen even necessary in Lucy?

    Probably. If you have a boatload of segments and a boatload of fields, you
    might start to see file opening and metadata parsing costs come into play. If
    it turns out that for some indexes reopen() can knock down the time from say,
    100 ms to 10 ms or less, I'd consider that sufficient justification.

    {quote}

    OK. Then, you are basically pooling your readers ;) Ie, you do allow
    in-process sharing, but only among readers.

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 22, 2009 at 12:27 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    I guess my confusion is what are all the other benefits of using
    file-backed RAM? You can efficiently use process only concurrency
    (though shared memory is technically an option for this too), and you
    have wicked fast open times (but, you still must warm, just like
    Lucene).
    Processes are Lucy's primary concurrency model. ("The OS is our JVM.")
    Making process-only concurrency efficient isn't optional -- it's a *core*
    *concern*.
    What else? Oh maybe the ability to inform OS not to cache
    eg the reads done when merging segments. That's one I sure wish
    Lucene could use...
    Lightweight searchers mean architectural freedom.

    Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you
    need for whatever app architecture you just dreamed up -- then destroy them
    just as effortlessly. Add another worker thread to your search server without
    having to consider the RAM requirements of a heavy searcher object. Create a
    command-line app to search a documentation index without worrying about
    daemonizing it. Etc.

    If your normal development pattern is a single monolithic Java process, then
    that freedom might not mean much to you. But with their low per-object RAM
    requirements and fast opens, lightweight searchers are easy to use within a
    lot of other development patterns. For example: lightweight searchers work
    well for maxing out multiple CPU cores under process-only concurrency.
    In exchange you risk the OS making poor choices about what gets
    swapped out (LRU policy is too simplistic... not all pages are created
    equal),
    The Linux virtual memory system, at least, is not a pure LRU. It utilizes a
    page aging algo which prioritizes pages that have historically been accessed
    frequently even when they have not been accessed recently:

    {panel}
    http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

    The default action when a page is first allocated, is to give it an
    initial age of 3. Each time it is touched (by the memory management
    subsystem) it's age is increased by 3 to a maximum of 20. Each time the
    Kernel swap daemon runs it ages pages, decrementing their age by 1.
    {panel}

    And while that system may not be ideal from our standpoint, it's still pretty
    good. In general, the operating system's virtual memory scheme is going to
    work fine as designed, for us and everyone else, and minimize memory
    availability wait times.

    When will swapping out the term dictionary be a problem?

    * For indexes where queries are made frequently, no problem.
    * Foir systems with plenty of RAM, no problem.
    * For systems that aren't very busy, no problem.
    * For small indexes, no problem.

    The only situation we're talking about is infrequent queries against large
    indexes on busy boxes where RAM isn't abundant. Under those circumstances, it
    *might* be noticable that Lucy's term dictionary gets paged out somewhat
    sooner than Lucene's.

    But in general, if the term dictionary gets paged out, so what? Nobody was
    using it. Maybe nobody will make another query against that index until next
    week. Maybe the OS made the right decision.

    OK, so there's a vulnerable bubble where the the query rate against a large
    index is neither too fast nor too slow, on busy machines where RAM isn't
    abundant. I don't think that bubble ought to drive major architectural
    decisions.

    Let me turn your question on its head. What does Lucene gain in return for
    the slow index opens and large process memory footprint of its heavy
    searchers?
    I do love how pure the file-backed RAM approach is, but I worry that
    down the road it'll result in erratic search performance in certain
    app profiles.
    If necessary, there's a straightforward remedy: slurp the relevant files into
    RAM at object construction rather than mmap them. The rest of the code won't
    know the difference between malloc'd RAM and mmap'd RAM. The slurped files
    won't take up any more space than the analogous Lucene data structures; more
    likely, they'll take up less.

    That's the kind of setting we'd hide away in the IndexManager class rather
    than expose as prominent API, and it would be a hint to index components
    rather than an edict.
    Yeah, that you need 3 files for the string sort cache is a little
    spooky... that's 3X the chance of a page fault.
    Not when using the compound format.
    But the CFS construction must also go through the filesystem (like
    Lucene) right? So you still incur IO load of creating the small
    files, then 2nd pass to consolidate. Yes.
    I think we may need to largely take "time" out of our programming
    languages, eg switch to much more declarative code, or
    something... wanna port Lucy to Erlang?

    But I'm not sure process only concurrency, sharing only via
    file-backed memory, is the answer either
    I think relying heavily on file-backed memory is particularly appropriate for
    Lucy because the write-once file format works well with MAP_SHARED memory
    segments. If files were being modified and had to be protected with
    semaphores, it wouldn't be as sweet a match.

    Focusing on process-only concurrency also works well for Lucy because host
    threading models differ substantially and so will only be accessible via a
    generalized interface from the Lucy C core. It will be difficult to tune
    threading performance through that layer of indirection -- I'm guessing beyond
    the ability of most developers since few will be experts in multiple host
    threading models. In contrast, expertise in process level concurrency will be
    easier to come by and to nourish.
    Using Zoie you can make reopen time insanely fast (much faster than I
    think necessary for most apps), but at the expense of some expected
    hit to searching/indexing throughput. I don't think that's the right
    tradeoff for Lucene.
    But as Jake pointed out early in the thread, Zoie achieves those insanely fast
    reopens without tight coupling to IndexWriter and its components. The
    auxiliary RAM index approach is well proven.
    Do you have any hard numbers on how much time it takes Lucene to load
    from a hot IO cache, populating its RAM resident data structures?
    Hmm, I don't spend a lot of time working with Lucene directly, so I might not
    be the person most likely to have data like that at my fingertips. Maybe that
    McCandless dude can help you out, he runs a lot of benchmarks. ;)

    Or maybe ask the Solr folks? I see them on solr-user all the time talking
    about "MaxWarmingSearchers". ;)
    OK. Then, you are basically pooling your readers Ie, you do allow
    in-process sharing, but only among readers.
    Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
    each new segment, but they would be private to each parent PolyReader. So if
    you reopened two IndexReaders at the same time after e.g. segment "seg_12"
    had been added, each would create a new, private SegReader for "seg_12".
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 22, 2009 at 7:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793737#action_12793737 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------


    {quote}
    Processes are Lucy's primary concurrency model. ("The OS is our JVM.")
    Making process-only concurrency efficient isn't optional - it's a core
    concern.
    {quote}

    OK

    {quote}
    Lightweight searchers mean architectural freedom.

    Create 2, 10, 100, 1000 Searchers without a second thought - as many as you
    need for whatever app architecture you just dreamed up - then destroy them
    just as effortlessly. Add another worker thread to your search server without
    having to consider the RAM requirements of a heavy searcher object. Create a
    command-line app to search a documentation index without worrying about
    daemonizing it. Etc.
    {quote}

    This is definitely neat.

    {quote}
    The Linux virtual memory system, at least, is not a pure LRU. It utilizes a
    page aging algo which prioritizes pages that have historically been accessed
    frequently even when they have not been accessed recently:

    http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html
    {quote}

    Very interesting -- thanks. So it also factors in how much the page
    was used in the past, not just how long it's been since the page was
    last used.

    {quote}
    When will swapping out the term dictionary be a problem?

    For indexes where queries are made frequently, no problem.
    Foir systems with plenty of RAM, no problem.
    For systems that aren't very busy, no problem.
    For small indexes, no problem.
    The only situation we're talking about is infrequent queries against large
    indexes on busy boxes where RAM isn't abundant. Under those circumstances, it
    might be noticable that Lucy's term dictionary gets paged out somewhat
    sooner than Lucene's.
    {quote}

    Even smallish indexes can see the pages swapped out? I'd think at
    low-to-moderate search traffic, any index could be at risk, depdending
    on whether other stuff in the machine wanting RAM or IO cache is
    running.

    {quote}
    But in general, if the term dictionary gets paged out, so what? Nobody was
    using it. Maybe nobody will make another query against that index until next
    week. Maybe the OS made the right decision.
    {quote}

    You can't afford many page faults until the latency becomes very
    apparent (until we're all on SSDs... at which point this may all be
    moot).

    Right -- the metric that the swapper optimizes is overall efficient
    use of the machine's resources.

    But I think that's often a poor metric for search apps... I think
    consistency on the search latency is more important, though I agree it
    depends very much on the app.

    I don't like the same behavior in my desktop -- when I switch to my
    mail client, I don't want to wait 10 seconds for it to swap the pages
    back in.

    {quote}
    Let me turn your question on its head. What does Lucene gain in return for
    the slow index opens and large process memory footprint of its heavy
    searchers?
    {quote}

    Consistency in the search time. Assuming the OS doesn't swap our
    pages out...

    And of course Java pretty much forces threads-as-concurrency (JVM
    startup time, hotspot compilation, are costly).

    {quote}
    If necessary, there's a straightforward remedy: slurp the relevant files into
    RAM at object construction rather than mmap them. The rest of the code won't
    know the difference between malloc'd RAM and mmap'd RAM. The slurped files
    won't take up any more space than the analogous Lucene data structures; more
    likely, they'll take up less.

    That's the kind of setting we'd hide away in the IndexManager class rather
    than expose as prominent API, and it would be a hint to index components
    rather than an edict.
    {quote}

    Right, this is how Lucy would force warming.

    {quote}
    bq. Yeah, that you need 3 files for the string sort cache is a little spooky... that's 3X the chance of a page fault.

    Not when using the compound format.
    {quote}

    But, even within that CFS file, these three sub-files will not be
    local? Ie you'll still have to hit three pages per "lookup" right?

    {quote}
    I think relying heavily on file-backed memory is particularly appropriate for
    Lucy because the write-once file format works well with MAP_SHARED memory
    segments. If files were being modified and had to be protected with
    semaphores, it wouldn't be as sweet a match.
    {quote}

    Write-once is good for Lucene too.

    {quote}
    Focusing on process-only concurrency also works well for Lucy because host
    threading models differ substantially and so will only be accessible via a
    generalized interface from the Lucy C core. It will be difficult to tune
    threading performance through that layer of indirection - I'm guessing beyond
    the ability of most developers since few will be experts in multiple host
    threading models. In contrast, expertise in process level concurrency will be
    easier to come by and to nourish.
    {quote}

    I'm confused by this -- eg Python does a great job presenting a simple
    threads interface and implementing it on major OSs. And it seems like
    Lucy would not need anything crazy-os-specific wrt threads?

    {quote}
    bq. Do you have any hard numbers on how much time it takes Lucene to load from a hot IO cache, populating its RAM resident data structures?

    Hmm, I don't spend a lot of time working with Lucene directly, so I might not
    be the person most likely to have data like that at my fingertips. Maybe that
    McCandless dude can help you out, he runs a lot of benchmarks.
    {quote}

    Hmm ;) I'd guess that field cache is slowish; deleted docs & norms are
    very fast; terms index is somewhere in between.

    bq. Or maybe ask the Solr folks? I see them on solr-user all the time talking about "MaxWarmingSearchers".

    Hmm -- not sure what's up with that. Looks like maybe it's the
    auto-warming that might happen after a commit.

    {quote}
    bq. OK. Then, you are basically pooling your readers Ie, you do allow in-process sharing, but only among readers.

    Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
    each new segment, but they would be private to each parent PolyReader. So if
    you reopened two IndexReaders at the same time after e.g. segment "seg_12"
    had been added, each would create a new, private SegReader for "seg_12".
    {quote}

    You're right, you'd get two readers for seg_12 in that case. By
    "pool" I meant you're tapping into all the sub-readers that the
    existing reader have opened -- the reader is your pool of sub-readers.

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 23, 2009 at 3:56 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793431#action_12793431 ]

    Marvin Humphrey edited comment on LUCENE-2026 at 12/23/09 3:54 AM:
    -------------------------------------------------------------------
    I guess my confusion is what are all the other benefits of using
    file-backed RAM? You can efficiently use process only concurrency
    (though shared memory is technically an option for this too), and you
    have wicked fast open times (but, you still must warm, just like
    Lucene).
    Processes are Lucy's primary concurrency model. ("The OS is our JVM.")
    Making process-only concurrency efficient isn't optional -- it's a *core*
    *concern*.
    What else? Oh maybe the ability to inform OS not to cache
    eg the reads done when merging segments. That's one I sure wish
    Lucene could use...
    Lightweight searchers mean architectural freedom.

    Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you
    need for whatever app architecture you just dreamed up -- then destroy them
    just as effortlessly. Add another worker thread to your search server without
    having to consider the RAM requirements of a heavy searcher object. Create a
    command-line app to search a documentation index without worrying about
    daemonizing it. Etc.

    If your normal development pattern is a single monolithic Java process, then
    that freedom might not mean much to you. But with their low per-object RAM
    requirements and fast opens, lightweight searchers are easy to use within a
    lot of other development patterns. For example: lightweight searchers work
    well for maxing out multiple CPU cores under process-only concurrency.
    In exchange you risk the OS making poor choices about what gets
    swapped out (LRU policy is too simplistic... not all pages are created
    equal),
    The Linux virtual memory system, at least, is not a pure LRU. It utilizes a
    page aging algo which prioritizes pages that have historically been accessed
    frequently even when they have not been accessed recently:

    {panel}
    http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

    The default action when a page is first allocated, is to give it an
    initial age of 3. Each time it is touched (by the memory management
    subsystem) it's age is increased by 3 to a maximum of 20. Each time the
    Kernel swap daemon runs it ages pages, decrementing their age by 1.
    {panel}

    And while that system may not be ideal from our standpoint, it's still pretty
    good. In general, the operating system's virtual memory scheme is going to
    work fine as designed, for us and everyone else, and minimize memory
    availability wait times.

    When will swapping out the term dictionary be a problem?

    * For indexes where queries are made frequently, no problem.
    * Foir systems with plenty of RAM, no problem.
    * For systems that aren't very busy, no problem.
    * -For small indexes, no problem.-

    The only situation we're talking about is infrequent queries against -large-
    indexes on busy boxes where RAM isn't abundant. Under those circumstances, it
    *might* be noticable that Lucy's term dictionary gets paged out somewhat
    sooner than Lucene's.

    But in general, if the term dictionary gets paged out, so what? Nobody was
    using it. Maybe nobody will make another query against that index until next
    week. Maybe the OS made the right decision.

    OK, so there's a vulnerable bubble where the the query rate against
    -a large index- an index is neither too fast nor too slow, on busy machines
    where RAM isn't abundant. I don't think that bubble ought to drive major
    architectural decisions.

    Let me turn your question on its head. What does Lucene gain in return for
    the slow index opens and large process memory footprint of its heavy
    searchers?
    I do love how pure the file-backed RAM approach is, but I worry that
    down the road it'll result in erratic search performance in certain
    app profiles.
    If necessary, there's a straightforward remedy: slurp the relevant files into
    RAM at object construction rather than mmap them. The rest of the code won't
    know the difference between malloc'd RAM and mmap'd RAM. The slurped files
    won't take up any more space than the analogous Lucene data structures; more
    likely, they'll take up less.

    That's the kind of setting we'd hide away in the IndexManager class rather
    than expose as prominent API, and it would be a hint to index components
    rather than an edict.
    Yeah, that you need 3 files for the string sort cache is a little
    spooky... that's 3X the chance of a page fault.
    Not when using the compound format.
    But the CFS construction must also go through the filesystem (like
    Lucene) right? So you still incur IO load of creating the small
    files, then 2nd pass to consolidate. Yes.
    I think we may need to largely take "time" out of our programming
    languages, eg switch to much more declarative code, or
    something... wanna port Lucy to Erlang?

    But I'm not sure process only concurrency, sharing only via
    file-backed memory, is the answer either
    I think relying heavily on file-backed memory is particularly appropriate for
    Lucy because the write-once file format works well with MAP_SHARED memory
    segments. If files were being modified and had to be protected with
    semaphores, it wouldn't be as sweet a match.

    Focusing on process-only concurrency also works well for Lucy because host
    threading models differ substantially and so will only be accessible via a
    generalized interface from the Lucy C core. It will be difficult to tune
    threading performance through that layer of indirection -- I'm guessing beyond
    the ability of most developers since few will be experts in multiple host
    threading models. In contrast, expertise in process level concurrency will be
    easier to come by and to nourish.
    Using Zoie you can make reopen time insanely fast (much faster than I
    think necessary for most apps), but at the expense of some expected
    hit to searching/indexing throughput. I don't think that's the right
    tradeoff for Lucene.
    But as Jake pointed out early in the thread, Zoie achieves those insanely fast
    reopens without tight coupling to IndexWriter and its components. The
    auxiliary RAM index approach is well proven.
    Do you have any hard numbers on how much time it takes Lucene to load
    from a hot IO cache, populating its RAM resident data structures?
    Hmm, I don't spend a lot of time working with Lucene directly, so I might not
    be the person most likely to have data like that at my fingertips. Maybe that
    McCandless dude can help you out, he runs a lot of benchmarks. ;)

    Or maybe ask the Solr folks? I see them on solr-user all the time talking
    about "MaxWarmingSearchers". ;)
    OK. Then, you are basically pooling your readers Ie, you do allow
    in-process sharing, but only among readers.
    Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
    each new segment, but they would be private to each parent PolyReader. So if
    you reopened two IndexReaders at the same time after e.g. segment "seg_12"
    had been added, each would create a new, private SegReader for "seg_12".

    *Edit*: updated to correct assertions about virtual memory performance with
    small indexes.

    was (Author: creamyg):
    I guess my confusion is what are all the other benefits of using
    file-backed RAM? You can efficiently use process only concurrency
    (though shared memory is technically an option for this too), and you
    have wicked fast open times (but, you still must warm, just like
    Lucene).
    Processes are Lucy's primary concurrency model. ("The OS is our JVM.")
    Making process-only concurrency efficient isn't optional -- it's a *core*
    *concern*.
    What else? Oh maybe the ability to inform OS not to cache
    eg the reads done when merging segments. That's one I sure wish
    Lucene could use...
    Lightweight searchers mean architectural freedom.

    Create 2, 10, 100, 1000 Searchers without a second thought -- as many as you
    need for whatever app architecture you just dreamed up -- then destroy them
    just as effortlessly. Add another worker thread to your search server without
    having to consider the RAM requirements of a heavy searcher object. Create a
    command-line app to search a documentation index without worrying about
    daemonizing it. Etc.

    If your normal development pattern is a single monolithic Java process, then
    that freedom might not mean much to you. But with their low per-object RAM
    requirements and fast opens, lightweight searchers are easy to use within a
    lot of other development patterns. For example: lightweight searchers work
    well for maxing out multiple CPU cores under process-only concurrency.
    In exchange you risk the OS making poor choices about what gets
    swapped out (LRU policy is too simplistic... not all pages are created
    equal),
    The Linux virtual memory system, at least, is not a pure LRU. It utilizes a
    page aging algo which prioritizes pages that have historically been accessed
    frequently even when they have not been accessed recently:

    {panel}
    http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

    The default action when a page is first allocated, is to give it an
    initial age of 3. Each time it is touched (by the memory management
    subsystem) it's age is increased by 3 to a maximum of 20. Each time the
    Kernel swap daemon runs it ages pages, decrementing their age by 1.
    {panel}

    And while that system may not be ideal from our standpoint, it's still pretty
    good. In general, the operating system's virtual memory scheme is going to
    work fine as designed, for us and everyone else, and minimize memory
    availability wait times.

    When will swapping out the term dictionary be a problem?

    * For indexes where queries are made frequently, no problem.
    * Foir systems with plenty of RAM, no problem.
    * For systems that aren't very busy, no problem.
    * For small indexes, no problem.

    The only situation we're talking about is infrequent queries against large
    indexes on busy boxes where RAM isn't abundant. Under those circumstances, it
    *might* be noticable that Lucy's term dictionary gets paged out somewhat
    sooner than Lucene's.

    But in general, if the term dictionary gets paged out, so what? Nobody was
    using it. Maybe nobody will make another query against that index until next
    week. Maybe the OS made the right decision.

    OK, so there's a vulnerable bubble where the the query rate against a large
    index is neither too fast nor too slow, on busy machines where RAM isn't
    abundant. I don't think that bubble ought to drive major architectural
    decisions.

    Let me turn your question on its head. What does Lucene gain in return for
    the slow index opens and large process memory footprint of its heavy
    searchers?
    I do love how pure the file-backed RAM approach is, but I worry that
    down the road it'll result in erratic search performance in certain
    app profiles.
    If necessary, there's a straightforward remedy: slurp the relevant files into
    RAM at object construction rather than mmap them. The rest of the code won't
    know the difference between malloc'd RAM and mmap'd RAM. The slurped files
    won't take up any more space than the analogous Lucene data structures; more
    likely, they'll take up less.

    That's the kind of setting we'd hide away in the IndexManager class rather
    than expose as prominent API, and it would be a hint to index components
    rather than an edict.
    Yeah, that you need 3 files for the string sort cache is a little
    spooky... that's 3X the chance of a page fault.
    Not when using the compound format.
    But the CFS construction must also go through the filesystem (like
    Lucene) right? So you still incur IO load of creating the small
    files, then 2nd pass to consolidate. Yes.
    I think we may need to largely take "time" out of our programming
    languages, eg switch to much more declarative code, or
    something... wanna port Lucy to Erlang?

    But I'm not sure process only concurrency, sharing only via
    file-backed memory, is the answer either
    I think relying heavily on file-backed memory is particularly appropriate for
    Lucy because the write-once file format works well with MAP_SHARED memory
    segments. If files were being modified and had to be protected with
    semaphores, it wouldn't be as sweet a match.

    Focusing on process-only concurrency also works well for Lucy because host
    threading models differ substantially and so will only be accessible via a
    generalized interface from the Lucy C core. It will be difficult to tune
    threading performance through that layer of indirection -- I'm guessing beyond
    the ability of most developers since few will be experts in multiple host
    threading models. In contrast, expertise in process level concurrency will be
    easier to come by and to nourish.
    Using Zoie you can make reopen time insanely fast (much faster than I
    think necessary for most apps), but at the expense of some expected
    hit to searching/indexing throughput. I don't think that's the right
    tradeoff for Lucene.
    But as Jake pointed out early in the thread, Zoie achieves those insanely fast
    reopens without tight coupling to IndexWriter and its components. The
    auxiliary RAM index approach is well proven.
    Do you have any hard numbers on how much time it takes Lucene to load
    from a hot IO cache, populating its RAM resident data structures?
    Hmm, I don't spend a lot of time working with Lucene directly, so I might not
    be the person most likely to have data like that at my fingertips. Maybe that
    McCandless dude can help you out, he runs a lot of benchmarks. ;)

    Or maybe ask the Solr folks? I see them on solr-user all the time talking
    about "MaxWarmingSearchers". ;)
    OK. Then, you are basically pooling your readers Ie, you do allow
    in-process sharing, but only among readers.
    Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for
    each new segment, but they would be private to each parent PolyReader. So if
    you reopened two IndexReaders at the same time after e.g. segment "seg_12"
    had been added, each would create a new, private SegReader for "seg_12".
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 23, 2009 at 3:59 am
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793918#action_12793918 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    Very interesting - thanks. So it also factors in how much the page
    was used in the past, not just how long it's been since the page was
    last used.
    In theory, I think that means the term dictionary will tend to be favored over
    the posting lists. In practice... hard to say, it would be difficult to test.
    :(
    Even smallish indexes can see the pages swapped out?
    Yes, you're right -- the wait time to get at a small term dictionary isn't
    necessarily small. I've amended my previous post, thanks.
    And of course Java pretty much forces threads-as-concurrency (JVM
    startup time, hotspot compilation, are costly).
    Yes. Java does a lot of stuff that most operating systems can also do, but of
    course provides a coherent platform-independent interface. In Lucy we're
    going to try to go back to the OS for some of the stuff that Java likes to
    take over -- provided that we can develop a sane genericized interface using
    configuration probing and #ifdefs.

    It's nice that as long as the box is up our OS-as-JVM is always running, so we
    don't have to worry about its (quite lengthy) startup time.
    Right, this is how Lucy would force warming.
    I think slurp-instead-of-mmap is orthogonal to warming, because we can warm
    file-backed RAM structures by forcing them into the IO cache, using either the
    cat-to-dev-null trick or something more sophisticated. The
    slurp-instead-of-mmap setting would cause warming as a side effect, but the
    main point would be to attempt to persuade the virtual memory system that
    certain data structures should have a higher status and not be paged out as
    quickly.
    But, even within that CFS file, these three sub-files will not be
    local? Ie you'll still have to hit three pages per "lookup" right?
    They'll be next to each other in the compound file because CompoundFileWriter
    orders them alphabetically. For big segments, though, you're right that they
    won't be right next to each other, and you could possibly incur as many as
    three page faults when retrieving a sort cache value.

    But what are the alternatives for variable width data like strings? You need
    the ords array anyway for efficient comparisons, so what's left are the
    offsets array and the character data.

    An array of String objects isn't going to have better locality than one solid
    block of memory dedicated to offsets and another solid block of memory
    dedicated to file data, and it's no fewer derefs even if the string object
    stores its character data inline -- more if it points to a separate allocation
    (like Lucy's CharBuf does, since it's mutable).

    For each sort cache value lookup, you're going to need to access two blocks of
    memory.

    * With the array of String objects, the first is the memory block dedicated
    to the array, and the second is the memory block dedicated to the String
    object itself, which contains the character data.
    * With the file-backed block sort cache, the first memory block is the
    offsets array, and the second is the character data array.

    I think the locality costs should be approximately the same... have I missed
    anything?
    Write-once is good for Lucene too. Hellyeah.
    And it seems like Lucy would not need anything crazy-os-specific wrt
    threads?
    It depends on how many classes we want to make thread-safe, and it's not just
    the OS, it's the host.

    The bare minimum is simply to make Lucy thread-safe as a library. That's
    pretty close, because Lucy studiously avoided global variables whenever
    possible. The only problems that have to be addressed are the VTable_registry
    Hash, race conditions when creating new subclasses via dynamic VTable
    singletons, and refcounts on the VTable objects themselves.

    Once those issues are taken care of, you'll be able to use Lucy objects in
    separate threads with no problem, e.g. one Searcher per thread.

    However, if you want to *share* Lucy objects (other than VTables) across
    threads, all of a sudden we have to start thinking about "synchronized",
    "volatile", etc. Such constructs may not be efficient or even possible under
    some threading models.
    Hmm I'd guess that field cache is slowish; deleted docs & norms are
    very fast; terms index is somewhere in between.
    That jibes with my own experience. So maybe consider file-backed sort caches
    in Lucene, while keeping the status quo for everything else?
    You're right, you'd get two readers for seg_12 in that case. By
    "pool" I meant you're tapping into all the sub-readers that the
    existing reader have opened - the reader is your pool of sub-readers.
    Each unique SegReader will also have dedicated "sub-reader" objects: two
    "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12"
    PostingsReaders, etc. However, all those sub-readers will share the same
    file-backed RAM data, so in that sense they're pooled.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 23, 2009 at 4:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794095#action_12794095 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    bq. Very interesting - thanks. So it also factors in how much the page was used in the past, not just how long it's been since the page was last used.

    In theory, I think that means the term dictionary will tend to be
    favored over the posting lists. In practice... hard to say, it would
    be difficult to test.
    {quote}

    Right... though, I think the top "trunks" frequently used by the
    binary search, will stay hot. But as you get deeper into the terms
    index, it's not as clear.

    {quote}
    bq. And of course Java pretty much forces threads-as-concurrency (JVM startup time, hotspot compilation, are costly).

    Yes. Java does a lot of stuff that most operating systems can also do, but of
    course provides a coherent platform-independent interface. In Lucy we're
    going to try to go back to the OS for some of the stuff that Java likes to
    take over - provided that we can develop a sane genericized interface using
    configuration probing and #ifdefs.

    It's nice that as long as the box is up our OS-as-JVM is always running, so we
    don't have to worry about its (quite lengthy) startup time.
    {quote}

    OS as JVM is a nice analogy. Java of course gets in the way, too,
    like we cannot properly set IO priorities, we can't give hints to the
    OS to tell it not to cache certain reads/writes (ie segment merging),
    can't pin pages ;), etc.

    {quote}
    bq. Right, this is how Lucy would force warming.

    I think slurp-instead-of-mmap is orthogonal to warming, because we can warm
    file-backed RAM structures by forcing them into the IO cache, using either the
    cat-to-dev-null trick or something more sophisticated. The
    slurp-instead-of-mmap setting would cause warming as a side effect, but the
    main point would be to attempt to persuade the virtual memory system that
    certain data structures should have a higher status and not be paged out as
    quickly.
    {quote}

    Woops, sorry, I misread -- now I understand. You can easily make
    certain files ram resident, and then be like Lucene (except the data
    structures are more compact). Nice.

    {quote}
    bq. But, even within that CFS file, these three sub-files will not be local? Ie you'll still have to hit three pages per "lookup" right?

    They'll be next to each other in the compound file because CompoundFileWriter
    orders them alphabetically. For big segments, though, you're right that they
    won't be right next to each other, and you could possibly incur as many as
    three page faults when retrieving a sort cache value.

    But what are the alternatives for variable width data like strings? You need
    the ords array anyway for efficient comparisons, so what's left are the
    offsets array and the character data.

    An array of String objects isn't going to have better locality than one solid
    block of memory dedicated to offsets and another solid block of memory
    dedicated to file data, and it's no fewer derefs even if the string object
    stores its character data inline - more if it points to a separate allocation
    (like Lucy's CharBuf does, since it's mutable).

    For each sort cache value lookup, you're going to need to access two blocks of
    memory.

    With the array of String objects, the first is the memory block dedicated
    to the array, and the second is the memory block dedicated to the String
    object itself, which contains the character data.
    With the file-backed block sort cache, the first memory block is the
    offsets array, and the second is the character data array.
    I think the locality costs should be approximately the same... have I missed
    anything?
    {quote}

    You're right, Lucene risks 3 (ord array, String array, String object)
    page faults on each lookup as well.

    Actually why can't ord & offset be one, for the string sort cache?
    Ie, if you write your string data in sort order, then the offsets are
    also in sort order? (I think we may have discussed this already?)

    {quote}
    bq. And it seems like Lucy would not need anything crazy-os-specific wrt threads?

    It depends on how many classes we want to make thread-safe, and it's not just
    the OS, it's the host.

    The bare minimum is simply to make Lucy thread-safe as a library. That's
    pretty close, because Lucy studiously avoided global variables whenever
    possible. The only problems that have to be addressed are the VTable_registry
    Hash, race conditions when creating new subclasses via dynamic VTable
    singletons, and refcounts on the VTable objects themselves.

    Once those issues are taken care of, you'll be able to use Lucy objects in
    separate threads with no problem, e.g. one Searcher per thread.

    However, if you want to share Lucy objects (other than VTables) across
    threads, all of a sudden we have to start thinking about "synchronized",
    "volatile", etc. Such constructs may not be efficient or even possible under
    some threading models.
    {quote}

    OK it is indeed hairy. You don't want to have to create Lucy's
    equivalent of the JMM...

    {quote}
    bq. Hmm I'd guess that field cache is slowish; deleted docs & norms are very fast; terms index is somewhere in between.

    That jibes with my own experience. So maybe consider file-backed sort caches
    in Lucene, while keeping the status quo for everything else?
    {quote}

    Perhaps, but it'd still make me nervous ;) When we get
    CSF (LUCENE-1231) online we should make it
    pluggable enough so that one could create an mmap impl.

    {quote}
    bq. You're right, you'd get two readers for seg_12 in that case. By "pool" I meant you're tapping into all the sub-readers that the existing reader have opened - the reader is your pool of sub-readers.

    Each unique SegReader will also have dedicated "sub-reader" objects: two
    "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12"
    PostingsReaders, etc. However, all those sub-readers will share the same
    file-backed RAM data, so in that sense they're pooled.
    {quote}

    OK

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey (JIRA) at Dec 23, 2009 at 6:27 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794137#action_12794137 ]

    Marvin Humphrey commented on LUCENE-2026:
    -----------------------------------------
    we can't give hints to the OS to tell it not to cache certain reads/writes
    (ie segment merging),
    For what it's worth, we haven't really solved that problem in Lucy either.
    The sliding window abstraction we wrapped around mmap/MapViewOfFile largely
    solved the problem of running out of address space on 32-bit operating
    systems. However, there's currently no way to invoke madvise through Lucy's
    IO abstraction layer -- it's a little tricky with compound files.

    Linux, at least, requires that the buffer supplied to madvise be page-aligned.
    So, say we're starting off on a posting list, and we want to communicate to
    the OS that it should treat the region we're about to read as MADV_SEQUENTIAL.
    If the start of the postings file is in the middle of a 4k page and the file
    right before it is a term dictionary, we don't want to indicate that that
    region should be treated as sequential.

    I'm not sure how to solve that problem without violating the encapsulation of
    the compound file model. Hmm, maybe we could store metadata about the virtual
    files indicating usage patterns (sequential, random, etc.)? Since files are
    generally part of dedicated data structures whose usage patterns are known at
    index time.

    Or maybe we just punt on that use case and worry only about segment merging.
    Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell
    the OS that it's free to recycle any memory pages associated with it?
    Actually why can't ord & offset be one, for the string sort cache?
    Ie, if you write your string data in sort order, then the offsets are
    also in sort order? (I think we may have discussed this already?)
    Right, we discussed this on lucy-dev last spring:

    http://markmail.org/message/epc56okapbgit5lw

    Incidentally, some of this thread replays our exchange at the top of
    LUCENE-1458 from a year ago. It was fun to go back and reread that: in the
    interrim, we've implemented segment-centric search and memory mapped field
    caches and term dictionaries, both of which were first discussed back then.
    :)

    Ords are great for low cardinality fields of all kinds, but become less
    efficient for high cardinality primitive numeric fields. For simplicity's
    sake, the prototype implementation of mmap'd field caches in KS always uses
    ords.
    You don't want to have to create Lucy's equivalent of the JMM...
    The more I think about making Lucy classes thread safe, the harder it seems.
    :( I'd like to make it possible to share a Schema across threads, for
    instance, but that means all its Analyzers, etc have to be thread-safe as
    well, which isn't practical when you start getting into contributed
    subclasses.

    Even if we succeed in getting Folders and FileHandles thread safe, it will be
    hard for the user to keep track of what they can and can't do across threads.
    "Don't share anything" is a lot easier to understand.

    We reap a big benefit by making Lucy's metaclass infrastructure thread-safe.
    Beyond that, seems like there's a lot of pain for little gain.
    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Dec 23, 2009 at 7:09 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794161#action_12794161 ]

    Michael McCandless commented on LUCENE-2026:
    --------------------------------------------

    {quote}
    For what it's worth, we haven't really solved that problem in Lucy either.
    The sliding window abstraction we wrapped around mmap/MapViewOfFile largely
    solved the problem of running out of address space on 32-bit operating
    systems. However, there's currently no way to invoke madvise through Lucy's
    IO abstraction layer - it's a little tricky with compound files.

    Linux, at least, requires that the buffer supplied to madvise be page-aligned.
    So, say we're starting off on a posting list, and we want to communicate to
    the OS that it should treat the region we're about to read as MADV_SEQUENTIAL.
    If the start of the postings file is in the middle of a 4k page and the file
    right before it is a term dictionary, we don't want to indicate that that
    region should be treated as sequential.

    I'm not sure how to solve that problem without violating the encapsulation of
    the compound file model. Hmm, maybe we could store metadata about the virtual
    files indicating usage patterns (sequential, random, etc.)? Since files are
    generally part of dedicated data structures whose usage patterns are known at
    index time.

    Or maybe we just punt on that use case and worry only about segment merging.
    {quote}

    Storing metadata seems OK. It'd be optional for codecs to declare that...

    {quote}
    Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell
    the OS that it's free to recycle any memory pages associated with it?
    {quote}

    It better!

    {quote}
    bq. Actually why can't ord & offset be one, for the string sort cache? Ie, if you write your string data in sort order, then the offsets are also in sort order? (I think we may have discussed this already?)

    Right, we discussed this on lucy-dev last spring:

    http://markmail.org/message/epc56okapbgit5lw
    {quote}

    OK I'll go try to catch up... but I'm about to drop [sort of]
    offline for a week and a half! There's alot of reading there! Should
    be a prereq that we first go back and re-read what we said "the last
    time"... ;)

    {quote}
    Incidentally, some of this thread replays our exchange at the top of
    LUCENE-1458 from a year ago. It was fun to go back and reread that: in the
    interrim, we've implemented segment-centric search and memory mapped field
    caches and term dictionaries, both of which were first discussed back then.
    {quote}

    Nice!

    {quote}
    Ords are great for low cardinality fields of all kinds, but become less
    efficient for high cardinality primitive numeric fields. For simplicity's
    sake, the prototype implementation of mmap'd field caches in KS always uses
    ords.
    {quote}

    Right...

    {quote}
    bq. You don't want to have to create Lucy's equivalent of the JMM...

    The more I think about making Lucy classes thread safe, the harder it seems.
    I'd like to make it possible to share a Schema across threads, for
    instance, but that means all its Analyzers, etc have to be thread-safe as
    well, which isn't practical when you start getting into contributed
    subclasses.

    Even if we succeed in getting Folders and FileHandles thread safe, it will be
    hard for the user to keep track of what they can and can't do across threads.
    "Don't share anything" is a lot easier to understand.

    We reap a big benefit by making Lucy's metaclass infrastructure thread-safe.
    Beyond that, seems like there's a lot of pain for little gain.
    {quote}

    Yeah. Threads are not easy :(

    Refactoring of IndexWriter
    --------------------------

    Key: LUCENE-2026
    URL: https://issues.apache.org/jira/browse/LUCENE-2026
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Index
    Reporter: Michael Busch
    Assignee: Michael Busch
    Priority: Minor
    Fix For: 3.1


    I've been thinking for a while about refactoring the IndexWriter into
    two main components.
    One could be called a SegmentWriter and as the
    name says its job would be to write one particular index segment. The
    default one just as today will provide methods to add documents and
    flushes when its buffer is full.
    Other SegmentWriter implementations would do things like e.g. appending or
    copying external segments [what addIndexes*() currently does].
    The second component's job would it be to manage writing the segments
    file and merging/deleting segments. It would know about
    DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
    provide hooks that allow users to manage external data structures and
    keep them in sync with Lucene's data during segment merges.
    API wise there are things we have to figure out, such as where the
    updateDocument() method would fit in, because its deletion part
    affects all segments, whereas the new document is only being added to
    the new segment.
    Of course these should be lower level APIs for things like parallel
    indexing and related use cases. That's why we should still provide
    easy to use APIs like today for people who don't need to care about
    per-segment ops during indexing. So the current IndexWriter could
    probably keeps most of its APIs and delegate to the new classes.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedNov 3, '09 at 11:17p
activeDec 23, '09 at 7:09p
posts34
users1
websitelucene.apache.org

1 user in discussion

Michael McCandless (JIRA): 34 posts

People

Translate

site design / logo © 2021 Grokbase