Grokbase Groups Lucene dev May 2010
FAQ
An Anti-Merging Multi-Directory Indexing Framework
--------------------------------------------------

Key: LUCENE-2425
URL: https://issues.apache.org/jira/browse/LUCENE-2425
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*, Index
Affects Versions: 3.0.1
Reporter: Karthick Sankarachary


By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.

Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.

In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.

We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Karthick Sankarachary (JIRA) at May 1, 2010 at 9:56 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karthick Sankarachary updated LUCENE-2425:
    ------------------------------------------

    Attachment: LUCENE-2425.patch
    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary
    Attachments: LUCENE-2425.patch


    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Karthick Sankarachary (JIRA) at May 1, 2010 at 9:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863062#action_12863062 ]

    Karthick Sankarachary commented on LUCENE-2425:
    -----------------------------------------------

    In this comment, we outline all split policies that have been or are in the process of being implemented. Hopefully, this will serve to not only validate the framework, but also be a reference point for future work.

    The split policies currently under development include:

    1) A rotating split policy, which is essentially a time-bound index, where each sub-index denotes a (contiguous) time range, and there's a cap on the number of sub-indices.
    2) An archiving split policy, which builds on the rotating split policy, where older sub-indexes (that have been rotated out) are kept around for a while before being removed.
    3) A real-time split policy, which overcomes the near-real time limitation of current indices. It does so by essentially maintaing a cache for each reader obtained for that index.
    4) A caching split policy, which builds on the real-time split policy, where writes (and other updates) to the index are buffered in-memory until it is told to flush.
    5) A mirroring split policy, which treats each sub-directory as a mirror image of the super-directory.
    6) A sharding split policy, which treats each sub-directory as a shard (or slice) or the super-directory.

    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary
    Attachments: LUCENE-2425.patch


    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Karthick Sankarachary (JIRA) at May 2, 2010 at 3:33 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karthick Sankarachary updated LUCENE-2425:
    ------------------------------------------

    Attachment: (was: LUCENE-2425.patch)
    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary

    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Karthick Sankarachary (JIRA) at May 2, 2010 at 3:35 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karthick Sankarachary updated LUCENE-2425:
    ------------------------------------------

    Attachment: LUCENE-2425.patch

    Added test cases for the framework.
    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary
    Attachments: LUCENE-2425.patch


    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Karthick Sankarachary (JIRA) at May 2, 2010 at 7:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863062#action_12863062 ]

    Karthick Sankarachary edited comment on LUCENE-2425 at 5/2/10 3:56 PM:
    -----------------------------------------------------------------------

    In this comment, we outline all split policies that have been or are in the process of being implemented. Hopefully, this will serve to not only validate the framework, but also be a reference point for future work.

    The split policies available so far include:

    1) LUCENE-2429: A rotating split policy, which is essentially a time-bound index, where each sub-index denotes a (contiguous) time range, and there's a cap on the number of sub-indices.
    2) LUCENE-2430: An archiving split policy, which builds on the rotating split policy, where older sub-indexes (that have been rotated out) are kept around for a while before being removed.
    3) LUCENE-2431: A real-time split policy, which overcomes the near-real time limitation of current indices. It does so by essentially maintaing a cache for each reader obtained for that index.
    4) LUCENE-2432: A caching split policy, which builds on the real-time split policy, where writes (and other updates) to the index are buffered in-memory until it is told to flush.
    5) LUCENE-2433: A remoting split policy, which is an abstraction where each sub-directory maps to a (remote) URI.
    5) LUCENE-2434: A mirroring split policy, which treats each sub-directory as a mirror image of the super-directory.
    6) LUCENE-2435: A sharding split policy, which treats each sub-directory as a shard (or slice) or the super-directory.

    The split policies under development include:


    was (Author: karthick):
    In this comment, we outline all split policies that have been or are in the process of being implemented. Hopefully, this will serve to not only validate the framework, but also be a reference point for future work.

    The split policies currently under development include:

    1) A rotating split policy, which is essentially a time-bound index, where each sub-index denotes a (contiguous) time range, and there's a cap on the number of sub-indices.
    2) An archiving split policy, which builds on the rotating split policy, where older sub-indexes (that have been rotated out) are kept around for a while before being removed.
    3) A real-time split policy, which overcomes the near-real time limitation of current indices. It does so by essentially maintaing a cache for each reader obtained for that index.
    4) A caching split policy, which builds on the real-time split policy, where writes (and other updates) to the index are buffered in-memory until it is told to flush.
    5) A mirroring split policy, which treats each sub-directory as a mirror image of the super-directory.
    6) A sharding split policy, which treats each sub-directory as a shard (or slice) or the super-directory.

    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary
    Attachments: LUCENE-2425.patch


    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 3, 2010 at 10:08 am
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863294#action_12863294 ]

    Michael McCandless commented on LUCENE-2425:
    --------------------------------------------
    From a distance this looks very interesting!
    It looks roughly similar to ParallelReader (and the ParallelWriter proposed/iterating on LUCENE-1879) is trying to accomplish, except they split a single document into different slices by field, whereas this issue is sending different documents to different slices.

    It looks like you split "under" the Directory abstraction? How do you handle the doc store (term vectors, stored fields) files, which IW normally writes as-it-indexes to single open IndexOutputs,?
    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary
    Attachments: LUCENE-2425.patch


    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Karthick Sankarachary (JIRA) at May 4, 2010 at 4:29 am
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863653#action_12863653 ]

    Karthick Sankarachary commented on LUCENE-2425:
    -----------------------------------------------

    Hi Michael,

    To answer your first question, yes I do see some similarities between this issue and LUCENE-1879. However, it appears that the latter serves only as a mirroring mechanism, whereas in this feature mirroring is but one of its many applications (see LUCENE-2433). That said, the caching split policy described in LUCENE-2433 does reuse the ParallelReader for reading the mirrors (or splits) it maintains. The big differences that I see are as follows:

    a) The split writer treats its (sub-)directories as black boxes, whereas the parallel writer appears to regards them as white-boxes.
    b) The parallel writer appears to require consumers to be aware of whether a sub-directory is a master or slave. The split writer, on the other hand, insulates the consumer from the implementation details of the mirroring mechanism, by providing them with a single, logical view into the mirrored index.
    c) The parallel writer proposes to use a two-phase mechanism for ensuring consistency of add/delete operations on the index. The mirroring split policy does not (yet) take care to ensure that the changes operate as a "unit of work", i.e., in an all-or-nothing fashion. For example, when you commit the split writer, it currently attempts to commit each of the writers for its sub-directories, but without addressing the failure scenario. To me, that is an oversight that can easily be remedied.

    The best way to understand the capabilities of the split policies outlined above is to take a look at their test cases. At the risk of sounding cliche, the proof is in the pudding.

    To answer your second question, a split does not necessarily need to be physically under the directory abstraction. For example, in the case of LUCENE-2431, LUCENE-2432, LUCENE-2433, LUCENE-2434 and LUCENE-2435, the splits are either RAM-based directories or URI-based directories, both of which reside outside of the "master" directory (to use the terminology of LUCENE-1879).

    Note that I don't go out of my way to ensure the consistency of the "postings files (merge choices, flush, deletions files, segments files, turning off the stores, etc.)" across the splits in the mirrored split writer. Instead, I assume that as long as the mirrors are configured and updated in the same way, then the doc store files in each mirror will eventually be consistent.

    Regards,
    Karthick
    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary
    Attachments: LUCENE-2425.patch


    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Otis Gospodnetic (JIRA) at Jun 2, 2010 at 6:19 am
    [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874445#action_12874445 ]

    Otis Gospodnetic commented on LUCENE-2425:
    ------------------------------------------

    Karthick, it looks like your May 1st comment ended with "The split policies under development include:", but without the actual list of those policies.


    An Anti-Merging Multi-Directory Indexing Framework
    --------------------------------------------------

    Key: LUCENE-2425
    URL: https://issues.apache.org/jira/browse/LUCENE-2425
    Project: Lucene - Java
    Issue Type: New Feature
    Components: contrib/*, Index
    Affects Versions: 3.0.1
    Reporter: Karthick Sankarachary
    Attachments: LUCENE-2425.patch


    By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized.
    Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge.
    In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index.
    We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedMay 1, '10 at 9:26p
activeJun 2, '10 at 6:19a
posts9
users1
websitelucene.apache.org

1 user in discussion

Otis Gospodnetic (JIRA): 9 posts

People

Translate

site design / logo © 2021 Grokbase