FAQ
Hi guys,

For the purpose of our product we've devised a bunch of small tool
classes which handle various utility tasks like:
1. IndexRecoverer - assuming the "segments" file is missing or
corrupted, this tool rebuilds it based on the *.cfs (and other) files
found in the index dir (excludes files listed in deletable)

2. IndexSplitter - splits an existing index in 2, 3 or more relatively
equally sized indices. It simply splits the segments files in distinct
directories and the uses the IndexRecoverer to rebuild each new Index's
segment file

3. IndexMerger - in reverse to IndexSplitter merges some indices into
single index; Uses a modified version of IndexWriter.addIndexes - it
does not optimize() in the beginning and in the end. This way the
resulting index is not a single huge cfs file, which is desirable in
some cases.

4. IndexOptimizer - Optimizes existing index by merging the 'small'
segments and compacting the large segments (compacting means 'removing
the deleted docs within them'); Also converts to compound file format
any old-style "spilled" segments.

All of the above mentioned tools are classes within the
org.apache.lucene.index package as they use some package-scope methods
and properties (+ they feel like belonging there).

Now the design change suggestion - it is about the 'deletable' related code;
according to the source comments - the delayed deletion of files
through the 'deletable' is required on Window only as this OS prevents
files opened for reading to be deleted.
Working on the IndexOptimizer tool I found myself in a situation where I
needed to 'safe delete' a bunch of obsolete segments while having only
an (FS)Directory and a segment file name. And the 'safe delete' feature
is in IndexWriter. Then after reviewing the code I came to the
conclusion that the 'safe delete' feature logically belongs to the
(FS)Directory class, not to IndexWriter. I was able to move the
corresponding code from IndexWriter to (FS)Directory IMO this way is better.
I am attaching (the 2.0.0) modified sources of IndexWriter and
(FS)Directory for your consideration. (Disclaimer - I can't guarantee my
changes are bug-free)

Best regards,
Stanislav

Search Discussions

  • Michael McCandless at Dec 6, 2006 at 6:10 pm
    Hello,

    These sound very interesting!

    I think some of them would go under contrib (as utility tools?) and
    others maybe into the core. I've added more detailed comments below.

    Stanislav Jordanov wrote:
    Hi guys,

    For the purpose of our product we've devised a bunch of small tool
    classes which handle various utility tasks like:
    1. IndexRecoverer - assuming the "segments" file is missing or
    corrupted, this tool rebuilds it based on the *.cfs (and other) files
    found in the index dir (excludes files listed in deletable)
    Excellent. I know that various cases of "recovering an index" have
    come up on the lists over time. It would be great to have a single
    tool that can try to correct the different problems that users hit, eg
    removing a single unusable segments file, regenerating the segments
    file, etc.
    2. IndexSplitter - splits an existing index in 2, 3 or more relatively
    equally sized indices. It simply splits the segments files in distinct
    directories and the uses the IndexRecoverer to rebuild each new Index's
    segment file
    Seems like a good tool for contrib?
    3. IndexMerger - in reverse to IndexSplitter merges some indices into
    single index; Uses a modified version of IndexWriter.addIndexes - it
    does not optimize() in the beginning and in the end. This way the
    resulting index is not a single huge cfs file, which is desirable in
    some cases.
    You should have a look at the current Lucene trunk: a new method
    (called addIndexesNoOptimize) has been added that I think addresses
    this same need.
    4. IndexOptimizer - Optimizes existing index by merging the 'small'
    segments and compacting the large segments (compacting means 'removing
    the deleted docs within them'); Also converts to compound file format
    any old-style "spilled" segments.
    Ooh -- this sounds like a lighter weight version of the current
    "optimize"? Compacting single segments would be particularly useful
    for very large indices that receive many updates to each doc. It
    seems like this could be a new method on IndexWriter?

    Though I think this could break the index segments invariants (new
    merge policy in IndexWriter in the trunk) when there are many deletes
    against the large older segments (I think a fairly typical use case
    actually).
    All of the above mentioned tools are classes within the
    org.apache.lucene.index package as they use some package-scope methods
    and properties (+ they feel like belonging there).

    Now the design change suggestion - it is about the 'deletable' related
    code;
    according to the source comments - the delayed deletion of files
    through the 'deletable' is required on Window only as this OS prevents
    files opened for reading to be deleted.
    Working on the IndexOptimizer tool I found myself in a situation where I
    needed to 'safe delete' a bunch of obsolete segments while having only
    an (FS)Directory and a segment file name. And the 'safe delete' feature
    is in IndexWriter. Then after reviewing the code I came to the
    conclusion that the 'safe delete' feature logically belongs to the
    (FS)Directory class, not to IndexWriter. I was able to move the
    corresponding code from IndexWriter to (FS)Directory IMO this way is
    better.
    You should also look at the trunk for this one. The deletion logic
    has moved into a separate class (IndexFileDeleter) which handles
    figuring out which files 1) look to be Lucene index files, but 2) are
    not in fact referenced by the current segments file, and then
    safely deletes them (retries).

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Doug Cutting at Dec 6, 2006 at 7:21 pm

    Michael McCandless wrote:
    1. IndexRecoverer - assuming the "segments" file is missing or
    corrupted, this tool rebuilds it based on the *.cfs (and other) files
    found in the index dir (excludes files listed in deletable)
    Excellent. I know that various cases of "recovering an index" have
    come up on the lists over time. It would be great to have a single
    tool that can try to correct the different problems that users hit, eg
    removing a single unusable segments file, regenerating the segments
    file, etc.
    2. IndexSplitter - splits an existing index in 2, 3 or more relatively
    equally sized indices. It simply splits the segments files in distinct
    directories and the uses the IndexRecoverer to rebuild each new
    Index's segment file
    Seems like a good tool for contrib?
    If these rely only on the public index-format spec and public index
    apis, then they could go in contrib, which would be easiest, since
    expectations about back-compatibility and long-term support are lower
    for contrib.

    But if they rely on index package internals then they should be
    maintained with the core. Then the question becomes: are these features
    that we can maintain long-term? The index implementation will likely
    evolve, and the existing public API should be supported through this
    evolution: APIs must be more durable than implementations. So, are
    these features things that can be supported through likely
    implementation changes?

    I suspect they are. We've talked about making the postings format more
    flexible, but I have not heard anyone talk about a need to substantially
    alter the segments & merging model. Are we comfortable adding public
    APIs that depend on that model?

    An index splitter is useful with parallel and/or distributed search.
    Splitting on segment boundaries is fairly limited, but perhaps, with
    clever use of IndexWriter.setMaxMergeDocs(), it is sufficient.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedDec 6, '06 at 4:36p
activeDec 6, '06 at 7:21p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase