FAQ
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Rutherglen updated LUCENE-1313:
-------------------------------------

Component/s: (was: contrib/*)
Index
Fix Version/s: 2.9
Priority: Minor (was: Major)
Description:
Realtime search with transactional semantics.

Possible future directions:
* Optimistic concurrency
* Replication

Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.

I think this issue can hold realtime benchmarks which include indexing and searching concurrently.

was:
Provides realtime search using Lucene. Conceptually, updates are divided into discrete transactions. The transaction is recorded to a transaction log which is similar to the mysql bin log. Deletes from the transaction are made to the existing indexes. Document additions are made to an in memory InstantiatedIndex. The transaction is then complete. After each transaction TransactionSystem.getSearcher() may be called which allows searching over the index including the latest transaction.

TransactionSystem is the main class. Methods similar to IndexWriter are provided for updating. getSearcher returns a Searcher class.

- getSearcher()
- addDocument(Document document)
- addDocument(Document document, Analyzer analyzer)
- updateDocument(Term term, Document document)
- updateDocument(Term term, Document document, Analyzer analyzer)
- deleteDocument(Term term)
- deleteDocument(Query query)
- commitTransaction(List<Document> documents, Analyzer analyzer, List<Term> deleteByTerms, List<Query> deleteByQueries)

Sample code:

{code}
// setup
FSDirectoryMap directoryMap = new FSDirectoryMap(new File("/testocean"), "log");
LogDirectory logDirectory = directoryMap.getLogDirectory();
TransactionLog transactionLog = new TransactionLog(logDirectory);
TransactionSystem system = new TransactionSystem(transactionLog, new SimpleAnalyzer(), directoryMap);

// transaction
Document d = new Document();
d.add(new Field("contents", "hello world", Field.Store.YES, Field.Index.TOKENIZED));
system.addDocument(d);

// search
OceanSearcher searcher = system.getSearcher();
ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs;
System.out.println(hits.length + " total results");
for (int i = 0; i < hits.length && i < 10; i++) {
Document d = searcher.doc(hits[i].doc);
System.out.println(i + " " + hits[i].score+ " " + d.get("contents");
}
{code}

There is a test class org.apache.lucene.ocean.TestSearch that was used for basic testing.

A sample disk directory structure is as follows:
/snapshot_105_00.xml | XML file containing which indexes and their generation numbers correspond to a snapshot. Each transaction creates a new snapshot file. In this file the 105 is the snapshotid, also known as the transactionid. The 00 is the minor version of the snapshot corresponding to a merge. A merge is a minor snapshot version because the data does not change, only the underlying structure of the index|
/3 | Directory containing an on disk Lucene index|
/log | Directory containing log files|
/log/log00000001.bin | Log file. As new log files are created the suffix number is incremented|


Affects Version/s: 2.4.1
Summary: Realtime Search (was: Ocean Realtime Search)
Realtime Search
---------------

Key: LUCENE-1313
URL: https://issues.apache.org/jira/browse/LUCENE-1313
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


Realtime search with transactional semantics.
Possible future directions:
* Optimistic concurrency
* Replication
Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Jason Rutherglen (JIRA) at Apr 1, 2009 at 10:50 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    The patch includes RealtimeIndex a basic class for performing atomic
    transactional realtime indexing and search. A single thread
    periodically flushes to disk the ram index. It relies on
    LUCENE-1516.

    We need to benchmark this, specifically 1) realtime w/ramdir
    transaction 2) realtime w/queued documents transaction 3) normal
    indexing. Realtime w/ramdir encodes the transaction to a
    RAMDirectory which is added to the RAM writer using
    IW.addIndexesNoOptimize. Option 1 may be slower than option 2,
    however if the system is replicating it may be the only option?

    Long term I believe we need to implement searching over the
    IndexWriter ram buffer (if possible). However I am not sure how
    option 2 would work with it?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 1, 2009 at 10:50 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694810#action_12694810 ]

    Jason Rutherglen edited comment on LUCENE-1313 at 4/1/09 3:50 PM:
    ------------------------------------------------------------------

    The patch includes RealtimeIndex a basic class for performing atomic
    transactional realtime indexing and search. A single thread
    periodically flushes to disk the ram index. It relies on
    LUCENE-1516.

    We need to benchmark this, specifically 1) realtime w/ramdir
    transaction 2) realtime w/queued documents transaction 3) normal
    indexing. Realtime w/ramdir encodes the transaction to a
    RAMDirectory which is added to the RAM writer using
    IW.addIndexesNoOptimize. Option 1 may be slower than option 2,
    however if the system is replicating it may be the only option?

    Long term I believe we need to implement searching over the
    IndexWriter ram buffer (if possible). However I am not sure how
    option 1 and replication would work with it?

    was (Author: jasonrutherglen):
    The patch includes RealtimeIndex a basic class for performing atomic
    transactional realtime indexing and search. A single thread
    periodically flushes to disk the ram index. It relies on
    LUCENE-1516.

    We need to benchmark this, specifically 1) realtime w/ramdir
    transaction 2) realtime w/queued documents transaction 3) normal
    indexing. Realtime w/ramdir encodes the transaction to a
    RAMDirectory which is added to the RAM writer using
    IW.addIndexesNoOptimize. Option 1 may be slower than option 2,
    however if the system is replicating it may be the only option?

    Long term I believe we need to implement searching over the
    IndexWriter ram buffer (if possible). However I am not sure how
    option 2 would work with it?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 2, 2009 at 8:54 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694917#action_12694917 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    Jason, your last patch looks like it's taking the "flush first to RAM Dir" approach I just described as the next step (on the java-dev thread), right? Which is great!

    So this has no external dependencies, right? And it simply layers on top of LUCENE-1516.

    I'd be very interested to compare (benchmark) this approach vs solely LUCENE-1516.

    Could we change this class so that instead of taking a Transaction object, holding adds & deletes, it simply mirrors IndexWriter's API? Ie, I'd like to decouple the performance optimization of "let's flush small segments ithrough a RAMDir first" from the transactional semantics of "I process a transaction atomically, and lock out other thread's transactions". Ie, the transactional restriction could/should layer on top of this performance optimization for near-realtime search?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 6, 2009 at 5:26 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696186#action_12696186 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    bq. So this has no external dependencies, right?

    Yes.

    {quote}I'd be very interested to compare (benchmark) this approach
    vs solely LUCENE-1516.{quote}

    Is the .alg using the NearRealtimeReader from LUCENE-1516 our
    best measure of realtime performance?

    {quote}
    the transactional restriction could/should layer on
    top of this performance optimization for near-realtime search?
    {quote}

    The transactional system should be able to support both methods.
    Perhaps a non-locking setting would allow the same RealtimeIndex
    class support both modes of operation?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 6, 2009 at 10:12 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696277#action_12696277 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    We'll need to integrate the RAM based indexer into IndexWriter
    to carry over the deletes to the ram index while it's copied to
    disk. This is similar to IndexWriter.commitMergedDeletes
    carrying deletes over at the segment reader level based on a
    comparison of the current reader and the cloned reader.
    Otherwise there's redundant deletions to the disk index using
    IW.deleteDocuments which can be unnecessarily expensive. To make
    external we would need to do the delete by doc id genealogy.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 7, 2009 at 8:57 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696438#action_12696438 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    I'd be very interested to compare (benchmark) this approach
    vs solely LUCENE-1516.
    Is the .alg using the NearRealtimeReader from LUCENE-1516 our
    best measure of realtime performance?
    {quote}

    So far, I think so? You get to set an update rate (delete + add) docs, eg 50 docs/sec, and a pause time between NRT reopens.

    Still, it's synthetic. If you guys (LinkedIn) have a way to fold in some realism into the test, that'd be great, if only "our app ingests at X docs(MB)/sec and reopens the NRT reader X times per second" to set our ballback.

    {quote}
    the transactional restriction could/should layer on
    top of this performance optimization for near-realtime search?
    The transactional system should be able to support both methods.
    Perhaps a non-locking setting would allow the same RealtimeIndex
    class support both modes of operation?
    {quote}

    Sorry, what are both "modes" of operation?

    I think there are two different "layers" here -- first layer optimizes NRT by flushing small segments to RAMDir first. This seems generally useful and in theory has no impact to the API IndexWriter exposes (it's "merely" an internal optimization). The second layer adds this new Transaction object, such that N adds/deletes/commit/re-open NRT reader can be done atomically wrt other pending Transaction objects.

    {quote}
    We'll need to integrate the RAM based indexer into IndexWriter
    to carry over the deletes to the ram index while it's copied to
    disk. This is similar to IndexWriter.commitMergedDeletes
    carrying deletes over at the segment reader level based on a
    comparison of the current reader and the cloned reader.
    Otherwise there's redundant deletions to the disk index using
    IW.deleteDocuments which can be unnecessarily expensive. To make
    external we would need to do the delete by doc id genealogy.
    {quote}

    Right, I think the RAMDir optimization would go directly into IW, if we can separate it out from Transaction. It could naturally derive from the existing RAMBufferSizeMB, ie if NRT forces a flush, so long as its tiny, put it into the local RAMDir instead of the actual Dir, then "deduct" that size from the allowed budget of DW's ram usage. When RAMDIr + DW exceeds RAMBufferSizeMB, we then merge all of RAMDir's segments into a "real" segment in the directory.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 8, 2009 at 12:24 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.jar

    Latest realtime code, transactions are removed.

    * Needs to be benchmarked

    * There could be concurrency issues around deletes that occur
    while directories are being flushed to disk.

    * It's Java JARed to include the files and directory structure.
    The patch relies on LUCENE-1516 which if included would make the
    changes incomprehensible




    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 8, 2009 at 9:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697226#action_12697226 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote} Still, it's synthetic. If you guys (LinkedIn) have a way
    to fold in some realism into the test, that'd be great, if only
    "our app ingests at X docs(MB)/sec and reopens the NRT reader X
    times per second" to set our ballback. {quote}

    The test we need to progress to is running the indexing side
    endlessly while also reopening every X seconds, then
    concurrently running searches. This way we can play with a bunch
    of settings (mergescheduler threads, merge factors, max merge
    docs, etc), use the python code to generate a dozen cases,
    execute them and find out what seems optimal for our corpus.
    It's a bit of work but probably the only way each Lucene user
    can conclusively say they have the optimal settings needed for
    their system. Usually there is a baseline QPS that is desired,
    where the reopen delay may be increased to accommodate a lack of
    QPS.

    The ram dir portion of the NRT indexing increases in speed when
    more threads are allocated but those compete with search
    threads, another issue to keep in mind.

    It might be good to add some default charting to
    contrib/benchmark?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 9, 2009 at 9:20 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697444#action_12697444 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    The test we need to progress to is running the indexing side
    endlessly while also reopening every X seconds, then
    concurrently running searches
    {quote}

    Do you have a sense of what we'd need to add to contrib/benchmark to make this test possible? LUCENE-1516 takes the first baby step (adds a "NearRealtimeReaderTask").

    {quote}
    Usually there is a baseline QPS that is desired,
    where the reopen delay may be increased to accommodate a lack of
    QPS.
    {quote}
    Right -- that's the point I made on java-dev about the "freedom" we have wrt NRT's performance.

    {quote}
    The ram dir portion of the NRT indexing increases in speed when
    more threads are allocated but those compete with search
    threads, another issue to keep in mind.
    {quote}
    Well, single threaded indexing speed is also improved by using RAM dir. Ie the use of RAM dir is orthogonal to the app's use of threads for indexing?

    {quote}
    It might be good to add some default charting to
    contrib/benchmark?
    {quote}
    I've switched to Google's visualization API (http://code.google.com/apis/visualization/) which is a delight (they have a simple-to-use Python wrapper). It'd be awesome to somehow get simple charting folded into benchmark... maybe start w/ shear data export (as tab/comma delimited line file), and then have a separate step that slurps that data in and makes a [Google vis] chart.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 17, 2009 at 8:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    I added an IndexWriter.getRAMIndex method that returns a
    RAMIndex object that can be updated and flushed to the
    underlying writer. I think this is better than adding more
    methods to IndexWriter and it separates out the logic of the RAM
    based near realtime index and the rest of IW.

    Package protected IW.addIndexesNoOptimize(DirectoryIndexReader[]
    readers) is added which is used by RAMIndex.flush. I thought
    this functionality could work for LUCENE-1589 as a public
    method, however because of the way IndexWriter performs merges
    using segment infos, handling generic IndexReader classes (which
    may not use segmentinfos) would then be difficult in the
    addIndexesNoOptimize case.

    I think RAMIndex.flush to the underlying writer is not
    synchronized. If the IW is using ConcurrentMergeScheduler then
    the heavy lifting is performed in the background and so should
    not delay adding more documents to the RAMIndex.

    IW.getReader returns the normal IW reader and the RAMIndex
    reader if there is one.

    The RAMIndex writer can be obtained and modified directly as
    opposed to duplicating the setter methods of IndexWriter such as
    setMergeScheduler.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 20, 2009 at 9:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    * The RAMIndex deletes approach changed to be like IndexWriter.
    The deletes are queued in lists, then applied on RI.flush.

    * There is redundancy between IW.delete* and RI.delete*, perhaps
    we don't need RI.delete*?

    * We need more multithreaded tests, probably based on
    TestIndexWriter to see if we can trigger issues in regards to
    deletes that occur while RI is calling IW.addIndexesNoOptimize.

    * If RI.delete* is removed, do we need a separate RAMIndex class
    to add documents to or is there a more transparent way for NRT
    ramdir to work? Perhaps we can add an IW.flushToRamDir (whereas
    IW.flush writes to the IW directory) method that flushes the
    rambuffer to the RAMIndex? Some of the the issues are around
    swapping out the RAMDir once it's segments are flushed to IW. If
    we took this approach would we need a IW.getReaderRAM method
    that instead of flushing to disk flushes to the ramdir? The
    other problem with the IW.flushToRamDir system is the loss of
    concurrency where a large rambuffer may be flushing to disk
    while the user really wants to small incremental NRT RI based
    updates at the same time.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 24, 2009 at 7:32 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702496#action_12702496 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    For this patch I'm debating whether to add a package protected
    IndexWriter.addIndexWriter method. The problem is, the RAMIndex
    blocks on the write to disk during IW.addIndexesNoOptimize which
    if we're using ConcurrentMergeScheduler shouldn't happen?
    Meaning in this proposed solution, if segments keep on piling up
    in RAMIndex, we simply move them over to the disk IW which will
    in the background take care of merging them away and to disk.

    I don't think it's necessary to immediately write ram segments
    to disk (like the current patch does), instead it's possible to
    simply copy segments over from the incoming IW, leave them in
    RAM and they can be merged to disk as necessary? Then on
    IW.flush any segmentinfo(s) that are not from the current
    directory can be flushed to disk?

    Just thinking out loud about this.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 24, 2009 at 8:04 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702515#action_12702515 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------


    bq. I don't think it's necessary to immediately write ram segments to disk

    I agree: it should be fine from IndexWriter's standpoint if some
    segments live in a private RAMDir and others live in the "real" dir.

    In fact, early versions of LUCENE-843 did exactly this: IW's RAM
    buffer is not as efficient as a written segment, and so you can gain
    some RAM efficiency by flushing first to RAM and then merging to disk.

    I think we could adopt a simple criteria: you flush the new segment to
    the RAM Dir if net RAM used is < maxRamBufferSizeMB. This way no
    further configuration is needed. On auto-flush triggering you then
    must take into account the RAM usage by this RAM Dir. On commit,
    these RAM segments must be migrated to the real dir (preferably by
    forcing a merge, somehow).

    A near realtime reader would also happily mix "real" Dir and RAMDir
    SegmentReaders.

    This should work well I think, and should not require a separate
    RAMIndex class, and won't block things when the RAM segments are
    migrated to disk by CMS.

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 24, 2009 at 9:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702573#action_12702573 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote} I think we could adopt a simple criteria: you flush the
    new segment to the RAM Dir if net RAM used is <
    maxRamBufferSizeMB. This way no further configuration is needed.
    On auto-flush triggering you then must take into account the RAM
    usage by this RAM Dir. {quote}

    So we're ok with the blocking that occurs when the ram buffer is
    flushed to the ramdir?

    {quote}On commit, these RAM segments must be migrated to the
    real dir (preferably by forcing a merge, somehow). {quote}

    This is pretty much like resolveExternalSegments which would be
    called in prepareCommit? This could make calls to commit much
    more time consuming. It may be confusing to the user why
    IW.flush doesn't copy the ram segments to disk.

    {quote}A near realtime reader would also happily mix "real" Dir
    and RAMDir SegmentReaders.{quote}

    Agreed, however the IW.getReader MultiSegmentReader removes
    readers from another directory so we'd need to add a new
    attribute to segmentinfo that marks it as ok for inclusion in
    the MSR?


    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 24, 2009 at 10:15 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702579#action_12702579 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    So we're ok with the blocking that occurs when the ram buffer is
    flushed to the ramdir?
    {quote}

    Well... we don't have a choice (unless/until we implement IndexReader impl to directly search the RAM buffer). Still, this should be a good improvement over the blocking when flushing to a real dir.

    {quote}
    This is pretty much like resolveExternalSegments which would be
    called in prepareCommit? This could make calls to commit much
    more time consuming. It may be confusing to the user why
    IW.flush doesn't copy the ram segments to disk.
    {quote}

    Similar... the difference is I'd prefer to do a merge of the RAM segments vs the straight one-for-one copy that resolveExternalSegments does.

    commit would only become more time consuming in the NRT case? IE we'd only flush-to-RAMdir if it's getReader that's forcing the flush? In which case, I think it's fine that commit gets more costly. Also, I wouldn't expect it to be much more costly: we are doing an in-memory merge of N segments, writing one segment to the "real" directory. Vs writing each tiny segment as a real one. In fact, commit could get cheaper (when compared to not making this change) since there are fewer new files to fsync.

    {quote}
    Agreed, however the IW.getReader MultiSegmentReader removes
    readers from another directory so we'd need to add a new
    attribute to segmentinfo that marks it as ok for inclusion in
    the MSR?
    {quote}

    Or, fix that filtering to also accept IndexWriter's RAMDir.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 24, 2009 at 10:49 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702596#action_12702596 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    I'm confused as to how we make DocumentsWriter switch from
    writing to disk vs the ramdir? It seems like a fairly major
    change to the system? One that's hard to switch later on after
    IW is instantiated? Perhaps the IW.addWriter method is easier in
    this regard?

    {quote} the difference is I'd prefer to do a merge of the RAM
    segments vs the straight one-for-one copy that
    resolveExternalSegments does.{quote}

    Yeah I implemented it this way in the IW.addWriter code. I agree
    it's better for IW.commit to copy all the ramdir segments to one
    disk segment.

    I started working on the IW.addWriter(IndexWriter, boolean
    removeFrom) where removeFrom removes the segments that have been
    copied to the destination writer from the source writer. This
    method gets around the issue of blocking because potentially
    several writers could concurrently be copied to the destination
    writer. The only issue at this point is how the destination
    writer obtains segmentreaders from source readers when they're
    in the other writers' pool? Maybe the SegmentInfo can have a
    reference to the writer it originated in? That way we can easily
    access the right reader pool when we need it?


    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 25, 2009 at 10:04 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702673#action_12702673 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    I'm confused as to how we make DocumentsWriter switch from
    writing to disk vs the ramdir? It seems like a fairly major
    change to the system? One that's hard to switch later on after
    IW is instantiated? Perhaps the IW.addWriter method is easier in
    this regard?
    {quote}

    When we create SegmentWriteState (which is supposed to contain all
    details needed to tell DW how/where to write the segment), we'd set
    its directory to the RAMDir? That ought to be all that's needed
    (though, it's possible some places use a private copy of the original
    directory, which we should fix). DW should care less which Directory
    the segment is written to...

    {quote}
    the difference is I'd prefer to do a merge of the RAM
    segments vs the straight one-for-one copy that
    resolveExternalSegments does.
    Yeah I implemented it this way in the IW.addWriter code. I agree
    it's better for IW.commit to copy all the ramdir segments to one
    disk segment.
    {quote}

    OK. Maybe we modify resolveExternalSegments to accept a "doMerge"?

    {quote}
    I started working on the IW.addWriter(IndexWriter, boolean
    removeFrom) where removeFrom removes the segments that have been
    copied to the destination writer from the source writer. This
    method gets around the issue of blocking because potentially
    several writers could concurrently be copied to the destination
    writer. The only issue at this point is how the destination
    writer obtains segmentreaders from source readers when they're
    in the other writers' pool? Maybe the SegmentInfo can have a
    reference to the writer it originated in? That way we can easily
    access the right reader pool when we need it?
    {quote}

    I don't think we need two writers? I think one writer, sometimes
    flushing to RAMDir, is a clean solution?

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 25, 2009 at 11:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702683#action_12702683 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    One separate optimization we should make with NRT is to not close the doc store (stored fields, term vector) files when flushing for an NRT reader.

    We do close them now, which then makes merging quite a bit more costly.

    The trickiness with this optimization is we'd need to be able to somehow share an IndexInput & IndexOutput; or, perhaps we can open an IndexInput even though an IndexOutput has the same file open (Windows may prevent this, though I think I've seen that it will in fact allow it).

    Once we do that optimization, then with this RAMDir optimization we should try to have the doc store files punch straight through to the real directory, ie bypass the RAMDir. The doc stores are space consuming, and since with autoCommit=false we can bypass merging them, it makes no sense to store them in the RAMDir.

    We should probably do this optimization as a "phase 2", after this one.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 27, 2009 at 7:37 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703315#action_12703315 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote} When we create SegmentWriteState (which is supposed to
    contain all details needed to tell DW how/where to write the
    segment), we'd set its directory to the RAMDir? That ought to be
    all that's needed (though, it's possible some places use a
    private copy of the original directory, which we should fix). DW
    should care less which Directory the segment is written to...
    {quote}

    Agreed that DW can write the segment to the RAMDir. I started
    coding along these lines however what do we do about the RAMDir
    merging? This is why I was thinking we'll need a separate IW?
    Otherwise the ram segments (if they are treated the same as disk
    segments) would quickly be merged to disk? Or we have two
    separate merging paths?

    If we have a disk IW and ram IW, I'm not sure how the docstores
    to disk part would work though I'm sure there's some way to do
    it.

    bq. modify resolveExternalSegments to accept a "doMerge"?

    Sounds good.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 27, 2009 at 7:51 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703317#action_12703317 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}we should make with NRT is to not close the doc store
    (stored fields, term vector) files when flushing for an NRT
    reader. {quote}

    Agreed, I think this feature is a must otherwise we're doing
    unnecessary in ram merging.

    {quote}we'd need to be able to somehow share an IndexInput &
    IndexOutput; or, perhaps we can open an IndexInput even though
    an IndexOutput{quote}

    I ran into problems with this before, I was trying to reuse
    Directory to write a transaction log. It seemed theoretically
    doable however it didn't work in practice. It could have been
    the seeking and replacing but I don't remember. FSIndexOutput
    uses a writeable RAF and FSIndexInput is read only why would
    there be an issue?



    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 27, 2009 at 7:59 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703327#action_12703327 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}doc store files punch straight through to the real
    directory{quote}

    To implement this functionality in parallel (and perhaps make
    the overall patch cleaner), writing doc stores directly to a
    separate directory can be a different patch? There can be an
    option IW.setDocStoresDirectory(Directory) that the patch
    implements? Then some unit tests that are separate from the near
    realtime portion.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 27, 2009 at 8:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703366#action_12703366 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    Agreed that DW can write the segment to the RAMDir. I started
    coding along these lines however what do we do about the RAMDir
    merging? This is why I was thinking we'll need a separate IW?
    Otherwise the ram segments (if they are treated the same as disk
    segments) would quickly be merged to disk? Or we have two
    separate merging paths?
    {quote}

    Hmm, right. We could exclude RAMDir segments from consideration by
    MergePolicy? Alternatively, we could "expect" the MergePolicy to
    recognize this and be smart about choosing merges (ie don't mix
    merges)?

    EG we do in fact want some merging of the RAM segments if they get too
    numerous (since that will impact search performance).

    {quote}
    we should make with NRT is to not close the doc store
    (stored fields, term vector) files when flushing for an NRT
    reader.
    Agreed, I think this feature is a must otherwise we're doing
    unnecessary in ram merging.
    {quote}

    OK, let's do this as a separate issue/optimization for NRT. There are
    two separate parts to it:

    * Ability to store doc stores in "real" directory (looks like you
    opened LUCENE-1618 for this part).

    * Ability to "share" IndexOutput & IndexInput

    {quote}
    I ran into problems with this before, I was trying to reuse
    Directory to write a transaction log. It seemed theoretically
    doable however it didn't work in practice. It could have been
    the seeking and replacing but I don't remember. FSIndexOutput
    uses a writeable RAF and FSIndexInput is read only why would
    there be an issue?
    {quote}

    Hmm... seems like we need to investigate further. We could either
    "ask" an IndexOutput for its IndexInput (sharing the underlying RAF),
    or try to separately open an IndexInput (which may not work on
    Windows).

    {quote}
    To implement this functionality in parallel (and perhaps make
    the overall patch cleaner), writing doc stores directly to a
    separate directory can be a different patch? There can be an
    option IW.setDocStoresDirectory(Directory) that the patch
    implements? Then some unit tests that are separate from the near
    realtime portion.
    {quote}

    Yes, separate issue (LUCENE-1618).

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 27, 2009 at 9:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703377#action_12703377 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    So let's leave this issue focused on sometimes using RAMDir for newly created segments.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 27, 2009 at 9:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703387#action_12703387 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote} We could exclude RAMDir segments from consideration by
    MergePolicy? Alternatively, we could "expect" the MergePolicy to
    recognize this and be smart about choosing merges (ie don't mix
    merges)? {quote}

    Is this over complicating things? Sometimes we want a mixture of
    RAMDir segments and FSDir segments to merge (when we've decided
    we have too much in ram), sometimes we don't (when we just want
    the ram segments to merge). I'm still a little confused as to
    why having a wrapper class that manages a disk writer and a ram
    writer isn't cleaner?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 27, 2009 at 10:17 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703419#action_12703419 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    Sometimes we want a mixture of
    RAMDir segments and FSDir segments to merge (when we've decided
    we have too much in ram),
    {quote}

    I don't think we want to mix RAM & disk merging?

    EG when RAM is full, we want to quickly flush it to disk as a single
    segment. Merging with disk segments only makes that flush slower?

    {quote}
    I'm still a little confused as to
    why having a wrapper class that manages a disk writer and a ram
    writer isn't cleaner?
    {quote}

    This is functionally the same as not mixing RAM vs disk merging,
    right (ie just as "clean")?

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 28, 2009 at 3:47 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703686#action_12703686 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    Yonik raised a good question on LUCENE-1618, which is what gains do we really expect to see by using RAMDir for the tiny recently flushed segments?

    It would be nice if we could approximately measure this before putting more work into this issue -- if the gains are not "decent" this optimization may not be worthwhile.

    Of course, we are talking about 100s of milliseconds for the turnaround time to add docs & open an NRT reader, so if the time for writing/opening many tiny files in RAMDir vs FSDir differs by say 10s of msecs then we should pursue this. We should also consider that the IO system may very well be quite busy (doing merge(s), backups, etc.) and that'd make it slower to have to create tiny files.

    A simpler optimization might be to allow using CFS for tiny files (even when CFS is turned off), but built the CFS in RAM (ie, write tiny files first to RAMFiles, then make the CFS file on disk). That might get most of the gains since the FSDir sees only one file created per tiny segment, not N.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley (JIRA) at Apr 28, 2009 at 4:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703695#action_12703695 ]

    Yonik Seeley commented on LUCENE-1313:
    --------------------------------------

    bq. Yonik raised a good question on LUCENE-1618, which is what gains do we really expect to see by using RAMDir for the tiny recently flushed segments?

    I raised it more because of the direction the discussion was veering (write through caching to a RAMDirectory, and RAMDirectory being faster to *search*). I do believe that RAMDirectory can probably improve NRT, but it would be due to avoiding waiting for file open/write/close/open/read (as Mike also said)... and not any difference during IndexSearcher.search(), which should be irrelevant due to the relative size differences of the RAMDirectory and the FSDirectory. Small file creation speeds will also be heavily dependent on the exact file system used.

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 28, 2009 at 10:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703853#action_12703853 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}EG when RAM is full, we want to quickly flush it to disk
    as a single segment. Merging with disk segments only makes that
    flush slower?{quote}

    I assume it's ok for the IW.mergescheduler to be used which may
    not immediately perform the merge to disk (in the case of
    ConcurrentMergeScheduler)? When implementing using
    addIndexesNoOptimize (which blocks) I realized we probably don't
    want blocking to occur because that means shutting down the
    updates.

    Also a random thought, it seems like ConcurrentMergeScheduler
    works great for RAMDir merging, how does it compare with
    SerialMS on an FSDirectory? It seems like it shouldn'y be too much
    faster given the IO sequential access bottleneck?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 29, 2009 at 10:04 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704051#action_12704051 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    I assume it's ok for the IW.mergescheduler to be used which may
    not immediately perform the merge to disk (in the case of
    ConcurrentMergeScheduler)?
    {quote}

    Only if we "accept" requiring MergePolicy to be aware that some
    segments are in RAMDir and some are in the "real" Dir and to "act
    accordingly", ie 1) don't mix the dirs when merging, 2) when RAM is
    "full" merge every single RAM segment into a single "real Dir" segment
    (requires IW to provide exposure on how much RAM DW's buffer is
    currently consuming), 3) properly "maintain" the RAM segments (ie,
    merge RAM -> RAM somehow) so that searchers don't search too many RAM
    segments.

    I think this approach is probably best: you're right that allowing CMS
    to manage these RAM segments is nice since it'll happen in the BG and
    will not block updates.

    It does mean, though, that the RAM usage semantics of IW is no longer
    so "crisp" as flushing today ("once RAM is full, stop world & flush it
    to disk, then resume") but I think that's acceptable and perhaps
    preferable since world is no longer stopped to flush RAM -> disk.

    Though one trickiness is... if a large RAM -> RAM merge takes place,
    we temporarily double the RAM consumption. I think MergePolicy simply
    shouldn't do that. Ie at not point should it be merging a very large
    %tg of the RAM segments. It should instead merge RAM -> disk.

    This'd also mean advanced users that implement their own MergePolicy
    must realize when IW is used with NRT reader that additional smarts is
    recommended wrt

    {quote}
    When implementing using
    addIndexesNoOptimize (which blocks) I realized we probably don't
    want blocking to occur because that means shutting down the
    updates.
    {quote}
    Right, this is one of the strong reasons to do the "internal" approach
    vs "external" one.

    {quote}
    Also a random thought, it seems like ConcurrentMergeScheduler
    works great for RAMDir merging, how does it compare with
    SerialMS on an FSDirectory? It seems like it shouldn'y be too much
    faster given the IO sequential access bottleneck?
    {quote}

    By far the biggest win of CMS over SMS is in the first merge, because
    it does not block the further addition of docs. Thus an app can
    continue indexing into RAM buffer (consuming CPU & RAM resources)
    while a BG thread consumes RAM + IO resources. This is very much a
    win.

    Beyond the first merge...in theory, modern IO systems have concurrency
    (eg the NCQ in a single SATA drive) so you should "gain" by having
    several threads performing IO at once. The OS & hard drives attempt
    to re-order the request in a more optimal way (like an elevator,
    sweeping floors). I haven't explictly tested this with Lucene...

    I believe SSDs handle concurrent requests very well since under the
    hood most of them are multi-channel basically RAID0 devices (eg Intel
    X25M has 10 channels).

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 29, 2009 at 9:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704340#action_12704340 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    Regarding the MergePolicy, what a particular merge policy is
    more optimal the ram segments vs. the real dir segments? Perhaps
    the best way to make this clean is to keep the ram merge policy
    and real dir merge policies different? That way we don't merge
    policy implementations don't need to worry about ram and non-ram
    dir cases.

    Perhaps an IW.updatePendingRamMerges method should be added that
    handles this separately? Does the ram dir ever need to worry
    about things like maxNumSegmentsOptimize and optimize?

    {quote}Right, this is one of the strong reasons to do the
    "internal" approach vs "external" one.{quote}

    I think having the ram merge policy should cover the reasons I
    had for having a separate ram writer. Although the IW.addWriter
    method I implemented would not have blocked, but I don't think
    it's necessary now if we have a separate ram merge policy.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 29, 2009 at 9:25 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704340#action_12704340 ]

    Jason Rutherglen edited comment on LUCENE-1313 at 4/29/09 2:24 PM:
    -------------------------------------------------------------------

    A merge policy may be more optimal for ram segments vs disk
    segments. Perhaps the best way to make this clean is to keep the
    ram merge policy and real dir merge policies different? That way
    we don't merge policy implementations don't need to worry about
    ram and non-ram dir cases.

    Perhaps an IW.updatePendingRamMerges method should be added that
    handles this separately? Does the ram dir ever need to worry
    about things like maxNumSegmentsOptimize and optimize?

    {quote}Right, this is one of the strong reasons to do the
    "internal" approach vs "external" one.{quote}

    I think having the ram merge policy should cover the reasons I
    had for having a separate ram writer. Although the IW.addWriter
    method I implemented would not have blocked, but I don't think
    it's necessary now if we have a separate ram merge policy.

    was (Author: jasonrutherglen):
    Regarding the MergePolicy, what a particular merge policy is
    more optimal the ram segments vs. the real dir segments? Perhaps
    the best way to make this clean is to keep the ram merge policy
    and real dir merge policies different? That way we don't merge
    policy implementations don't need to worry about ram and non-ram
    dir cases.

    Perhaps an IW.updatePendingRamMerges method should be added that
    handles this separately? Does the ram dir ever need to worry
    about things like maxNumSegmentsOptimize and optimize?

    {quote}Right, this is one of the strong reasons to do the
    "internal" approach vs "external" one.{quote}

    I think having the ram merge policy should cover the reasons I
    had for having a separate ram writer. Although the IW.addWriter
    method I implemented would not have blocked, but I don't think
    it's necessary now if we have a separate ram merge policy.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at Apr 30, 2009 at 1:19 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704628#action_12704628 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    Perhaps the best way to make this clean is to keep the
    ram merge policy and real dir merge policies different? That way
    we don't merge policy implementations don't need to worry about
    ram and non-ram dir cases.
    {quote}

    OK tentatively this feels like a good approach. Would you re-use
    MergePolicy, or make a new RAMMergePolicy?

    Would we use the same MergeScheduler to then execute the selected
    merges?

    How would we handle the "it's time to flush some RAM to disk" case?
    Would RAMMergePolicy make that decision?

    bq. Perhaps an IW.updatePendingRamMerges method should be added that handles this separately?

    Yes?

    bq. Does the ram dir ever need to worry about things like maxNumSegmentsOptimize and optimize?

    No?

    {quote}
    I think having the ram merge policy should cover the reasons I
    had for having a separate ram writer. Although the IW.addWriter
    method I implemented would not have blocked, but I don't think
    it's necessary now if we have a separate ram merge policy.
    {quote}

    OK good.

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 30, 2009 at 8:27 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    {quote} Would you re-use MergePolicy, or make a new
    RAMMergePolicy? {quote}

    MergePolicy is used as is with a special IW method that handles
    merging ram segments for the real directory (which has an issue
    around merging contiguous segments, can that be relaxed in this
    case as I don't understand why this is?)

    The patch is not committable, however I am posting it to show a
    path that seems to work. It includes test cases for merging in
    ram and merging to the real directory.

    * IW.getFlushDirectory is used by internal calls to obtain the
    directory to flush segments to. This is used in DocumentsWriter
    related calls.

    * DocumentsWriter.directory is removed so that methods requiring
    the directory call IW.getFlushDirectory instead.

    * IW.setRAMDirectory sets the ram directory to be used.

    * IW.setRAMMergePolicy sets the merge policy to be used for
    merging segments on the ram dir.

    * In IW.updatePendingMerges totalRamUsed is the size of the ram
    segments + the ram buffer used. If totalRamUsed exceeds the max
    ram buffer size then IW. updatePendingRamMergesToRealDir is
    called.

    * IW. updatePendingRamMergesToRealDir registers a merge of the
    ram segments to the real directory (currently causes a
    non-contiguous segments exception)

    * MergePolicy.OneMerge has a directory attribute used when
    building the merge.info in _mergeInit.

    * Test case includes testMergeInRam, testMergeToDisk,
    testMergeRamExceeded

    There is one error that occurs regularly in testMergeRamExceeded
    {code} MergePolicy selected non-contiguous segments to merge
    (_bo:cx83 _bm:cx4 _bn:cx2 _bl:cx1->_bj _bp:cx1->_bp _bq:cx1->_bp
    _c2:cx1->_c2 _c3:cx1->_c2 _c4:cx1->_c2 vs _5x:c120 _6a:c8
    _6t:c11 _bo:cx83** _bm:cx4** _bn:cx2** _bl:cx1->_bj**
    _bp:cx1->_bp** _bq:cx1->_bp** _c1:c10 _c2:cx1->_c2**
    _c3:cx1->_c2** _c4:cx1->_c2**), which IndexWriter (currently)
    cannot handle {code}
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 30, 2009 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    * Ok, fixed the ensureContiguousMerge exception by asking the
    mergePolicy (not ramMergePolicy) to evaluate the ram segment
    infos as an optimize to directory. Now all the current tests
    pass.

    * The patch is cleaned up a little, needs more, and further test
    cases.

    * IndexWriter doesn't accept setRAMDirectory anymore, it needs to
    be passed into the IndexWriter constructor. This because we
    can't run the system and the ram dir is changed in the middle of
    an operation.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at Apr 30, 2009 at 11:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    Fixed and cleaned up more.

    All tests pass

    Added entry in CHANGES.txt

    I'm going to integrate LUCENE-1618 and test that out as a part of the next patch.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedApr 1, '09 at 9:07p
activeApr 30, '09 at 11:21p
posts36
users1
websitelucene.apache.org

1 user in discussion

Jason Rutherglen (JIRA): 36 posts

People

Translate

site design / logo © 2021 Grokbase