FAQ
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12704995#action_12704995 ]

Michael McCandless commented on LUCENE-1313:
--------------------------------------------

Patch looks good! Some comments:

* I don't think the caller should provide the RAMDir... and we
should have a getter to get it. I think this should be
under-the-hood. This is simply a nice way for IW to use RAM as
buffer in the presence of frequent NRT readers being opened.

* If NRT is never used, the behavior of IW should be unchanged
(which is not the case w/ this patch I think). RAMDir should be
created the first time a flush is done due to NRT creation.

* StoredFieldsWriter & TermVectorsTermsWriter now writes to
IndexWriter.getFlushDirectory(), which is confusing because that
method returns the RAMDir if set? Shouldn't this be the opposite?
(Ie it should flush to IndexWriter.getDirectory()? Or we should
change getFlushDiretory to NOT return the ramdir?)

* Why did you need to add synchronized to some of the SegmentInfo
files methods? (What breaks if you undo that?). The contract
here is IW protects access to SegmentInfo/s.

* The MergePolicy needs some smarts when it's dealing w/ RAM. EG it
should not do a merge of more than XXX% of total RAM usage (should
flush to the real directory instead).

* Nothing is calling the new ramOverLimit?

* Still some noise (MockRAMDir, DocFieldProcessorPerThread, some
changes in LogMergePolicy)

Realtime Search
---------------

Key: LUCENE-1313
URL: https://issues.apache.org/jira/browse/LUCENE-1313
Project: Lucene - Java
Issue Type: New Feature
Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


Realtime search with transactional semantics.
Possible future directions:
* Optimistic concurrency
* Replication
Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Jason Rutherglen (JIRA) at May 1, 2009 at 6:39 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    * IndexFileDeleter takes into account the ram directory (which
    when using NRT with the FSD caused files to not be found).

    * FSD is included and writes fdx, fdt, tvx, tvf, tvd extension
    files to the primary directory (which is the same as
    IW.directory). LUCENE-1618 needs to be updated with these
    changes (or we simply include it in this patch as the
    LUCENE-1618 patch is only a couple of files).

    * Removed DocumentsWriter.ramOverLimit

    * I think we need to give the option of a ram mergescheduler
    because the user may want not want the ram merging and disk
    merging to compete for threads. I'm thinking if of the use case
    where NRT is a priority then one may allocate more threads to
    the ram CMS and less to the disk CMS. This also gives us the
    option of trying out more parameters when performing benchmarks
    of NRT.

    * We may want to default the ram mergepolicy to not use compound
    files as it's not useful when using a ram dir?

    * Because FSD uses IW.directory, FSD will list files that
    originated from FSD and from IW.directory, we may want to keep
    track of which files are supposed to be in FSD (from the
    underlying primary dir) and which are not?

    {quote}If NRT is never used, the behavior of IW should be
    unchanged (which is not the case w/ this patch I think). RAMDir
    should be created the first time a flush is done due to NRT
    creation. {quote}

    In the patch if ramdir is not passed in, the behavior of IW
    remains the same as it is today. You're saying we should have IW
    create the ramdir by default after getReader is called and
    remove the IW ramdir constructor? What if the user has an
    alternative ramdir implementation they want to use?

    {quote}StoredFieldsWriter & TermVectorsTermsWriter now writes to
    IndexWriter.getFlushDirectory(), which is confusing because that
    method returns the RAMDir if set? Shouldn't this be the
    opposite? (Ie it should flush to IndexWriter.getDirectory()? Or
    we should change getFlushDiretory to NOT return the
    ramdir?){quote}

    The attached patch uses FileSwitchDirectory, where these files
    are written to the primary directory (IW.directory). So
    getFlushDirectory is ok?

    {quote}Why did you need to add synchronized to some of the
    SegmentInfo files methods? (What breaks if you undo that?). The
    contract here is IW protects access to SegmentInfo/s{quote}

    SegmentInfo.files was being cleared while sizeInBytes was called
    which resulted in an NPE. The alternative is sync IW in
    IW.size(SegmentInfos) which seems a bit extreme just to obtain
    the size of a segment info?

    {quote}The MergePolicy needs some smarts when it's dealing w/
    RAM. EG it should not do a merge of more than XXX% of total RAM
    usage (should flush to the real directory instead){quote}

    Isn't this handled well enough in updatePendingMerges or is
    there more that needs to be done?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 2, 2009 at 10:51 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705255#action_12705255 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------


    {quote}
    IndexFileDeleter takes into account the ram directory (which
    when using NRT with the FSD caused files to not be found).
    {quote}

    I don't like how "deep" the dichotomy of "RAMDir vs FSDir" is being
    pushed. Why can't we push FSD down to all these places (IFD,
    SegmentInfo/s, etc.)?

    {quote}
    FSD is included and writes fdx, fdt, tvx, tvf, tvd extension
    files to the primary directory (which is the same as
    IW.directory). LUCENE-1618 needs to be updated with these
    changes (or we simply include it in this patch as the
    LUCENE-1618 patch is only a couple of files).
    {quote}

    Why did this require changes to FSD?

    {quote}
    I think we need to give the option of a ram mergescheduler
    because the user may want not want the ram merging and disk
    merging to compete for threads. I'm thinking if of the use case
    where NRT is a priority then one may allocate more threads to
    the ram CMS and less to the disk CMS. This also gives us the
    option of trying out more parameters when performing benchmarks
    of NRT.
    {quote}

    I think we're unlikely to gain from more than 1 BG thread for RAM
    merging? But I agree it'd be horrible if CMS blocked RAM merging
    because its allotted threads were tied up merging disk segments.
    Could we simply make the single CMS instance smart enoguh to realize
    that a single RAM merge is allowed to proceed regardless of the thread
    limit?

    {quote}
    We may want to default the ram mergepolicy to not use compound
    files as it's not useful when using a ram dir?
    {quote}

    I think actually hardwire it, not just default. Building CFS in RAM
    makes no sense. Worse, if we allow one to choose to do it we then
    have to fix FSD to understand CFX must go to the dir too, and, we'd
    have to fix IW to not merge in the doc store files when building a
    private CFS. Net/net I think we should not allow CFS for the RAM
    segments.

    On merging to disk it can then respect the user's CFS setting.

    {quote}
    Because FSD uses IW.directory, FSD will list files that
    originated from FSD and from IW.directory, we may want to keep
    track of which files are supposed to be in FSD (from the
    underlying primary dir) and which are not?
    {quote}

    I don't understand what's wrong here?

    {quote}
    If NRT is never used, the behavior of IW should be
    unchanged (which is not the case w/ this patch I think). RAMDir
    should be created the first time a flush is done due to NRT
    creation.
    In the patch if ramdir is not passed in, the behavior of IW
    remains the same as it is today. You're saying we should have IW
    create the ramdir by default after getReader is called and
    remove the IW ramdir constructor?
    {quote}

    Right. This should be "under the hood".

    {quote}
    What if the user has an alternative ramdir implementation they want to
    use?
    {quote}

    I think I'd rather not open up that option just yet. This really is a
    private optimization to how IW uses RAM. We may want to further
    change/improve how RAM is used.

    Way back when, IW used a RAMDir internally for buffering; then, with
    LUCENE-843 we switched to whole different format (DW's ram
    buffering). Now we are adding back RAMDir for NRT; maybe we'll switch
    its format at some point... or change NRT to directly search DW's
    RAM... etc. How IW uses RAM is very much an internal detail so I'd
    rather not expose it publically.

    [BTW: once we have this machinery online, it's conceivable that we'd
    want to flush to RAMDir even in the non-NRT case. EG, say DW's RAM
    buffer is full and it's time to flush. If it flushes to RAM,
    typically the RAMDir is far more compact than DW's RAM buffer and it
    then still has some more space to work with, before having to flush to
    disk. If we explore this it should be in a new issue (later)...]

    {quote}
    StoredFieldsWriter & TermVectorsTermsWriter now writes to
    IndexWriter.getFlushDirectory(), which is confusing because that
    method returns the RAMDir if set? Shouldn't this be the
    opposite? (Ie it should flush to IndexWriter.getDirectory()? Or
    we should change getFlushDiretory to NOT return the
    ramdir?)
    The attached patch uses FileSwitchDirectory, where these files
    are written to the primary directory (IW.directory). So
    getFlushDirectory is ok?
    {quote}

    OK, though I'd like to simply always use FSD, even if primary &
    secondary are the same dir. All these if's checking for both dirs,
    passing both dirs deep into Lucene's APIs, etc., are spooky.

    {quote}
    Why did you need to add synchronized to some of the
    SegmentInfo files methods? (What breaks if you undo that?). The
    contract here is IW protects access to SegmentInfo/s
    SegmentInfo.files was being cleared while sizeInBytes was called
    which resulted in an NPE. The alternative is sync IW in
    IW.size(SegmentInfos) which seems a bit extreme just to obtain
    the size of a segment info?
    {quote}

    But... why did we have one thread asking for size while another was
    tweaking the SegmentInfo? What leads to that? We need to better
    understand the root cause here.

    The size consumed by the RAM segments should be carefully computed
    (called only in sychchronized(iw) context) and then shared. This value
    changes relatively rarely (on flushing a new segment to ram; on
    applying deletes that include RAM segments; on doing a ram->ram
    merge), but is read frequently (per doc added, to decide whether it's
    time to flush). I think the value should be pushed to DW whenever it
    changes, via synchronized method in DW; and then the existing
    synchronized logic in DW that decides if it's time to "flush after"
    should consult that value. No further synchronizing should be
    necessary.

    Also, this ram size should be used not only for deciding when it's
    time to merge to a disk segment, but also when it's time for DW to
    flush a new segment (which I think your current patch is missing?).

    {quote}
    The MergePolicy needs some smarts when it's dealing w/ RAM. EG it
    should not do a merge of more than XXX% of total RAM usage (should
    flush to the real directory instead).
    Isn't this handled well enough in updatePendingMerges or is
    there more that needs to be done?
    {quote}

    There is more that needs to be done, because MergePolicy must
    conditionalize its logic based on RAM vs FS. Ie, if our RAM buffer is
    32 MB, and there are say 31 MB of RAM segments that suddenly need
    merging (becuase we just flushed the 10th RAM segment), we should not
    do a RAM -> RAM merge at that point (because 31./32. = very high pctg
    of our net RAM buffer). Instead we should force RAM -> disk at that
    point, even though technically RAM is not yet full.

    Ooh: maybe a better approach is to disallow the merge if the expected
    peak RAM usage will exceed our buffer. I like this better. So if
    budget is 32 MB, and net RAM used (segments + DW) is say 22, we have a
    10 MB "budget", so we are allowed to select merges that total to < 10
    MB.

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 4, 2009 at 6:31 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705675#action_12705675 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}I don't like how "deep" the dichotomy of "RAMDir vs
    FSDir" {quote}

    Agreed, it's a bit awkward but I don't see another way to do
    this. The good thing is if IW has written some .fdt files to the
    main dir (via FSD), IW crashes, then IW is created again, IFD
    automatically deletes the extraneous .fdt (and other extension)
    files.

    {quote}Why can't we push FSD down to all these places (IFD,
    SegmentInfo/s, etc.)?{quote}

    {quote}Could we simply make the single CMS instance smart enough
    to realize that a single RAM merge is allowed to proceed
    regardless of the thread limit?{quote}

    Hmm... I think for benchmarking it would be good to allow
    options as we simply don't know. In the latest patch a ram
    mergescheduler can be set to the IndexWriter.

    {quote}have to fix FSD to understand CFX must go to the dir
    too{quote}

    I think this is fixed in the patch, where compound files are not
    created in RAM. {quote}

    You're saying we should have IW create the ramdir by default
    after getReader is called and remove the IW ramdir constructor?
    Right. This should be "under the hood".{quote}

    Ok, this will require some reworking of the patch.

    {quote}OK, though I'd like to simply always use FSD, even if
    primary & secondary are the same dir. {quote}

    How will always using FSD work? Doesn't it assume writing to two
    different directories?

    {quote}this ram size should be used not only for deciding when
    it's time to merge to a disk segment, but also when it's time
    for DW to flush a new segment{quote}

    In the new patch this is fixed.

    {quote}So if budget is 32 MB, and net RAM used (segments + DW)
    is say 22, we have a 10 MB "budget", so we are allowed to select
    merges that total to < 10 MB.{quote}

    One issue is the ram buffer flush doubles the ram used (because
    the segment is flushed as is to the RAM dir). You're saying
    roughly estimate the ram size used on the result of a merge and
    have the merge policy take this into account? This makes sense,
    otherwise we will consistently (if temporarily) exceed the ram
    buffer size. The algorithm is fairly simple? Find segments whose
    total sizes are lower than whatever we have left of the max ram
    buffer size? I have new code, but will rework it a bit to
    include this discussion.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 4, 2009 at 10:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705788#action_12705788 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    OK, though I'd like to simply always use FSD, even if
    primary & secondary are the same dir.
    How will always using FSD work? Doesn't it assume writing to two
    different directories?
    {quote}

    I think on creating IW the user should state (via new expert ctor)
    that they intend to use it for NRT (say, a new boolean
    "enableNearRealTime").

    Then we could pass IFD either an FSD (when in NRT mode) or the normal
    directory when not in NRT mode. IFD would not longer have to
    duplicate FSD's logic (summing the two dir's listAlls, the
    getDirectoryForFile).

    SegmentInfos.hasExternalSegments, and MultiSegmentReader ctor, should
    be "smart" when they're passed an FSD (probably we should add
    Directory.contains(Directory) method, which by default returns true if
    this.equals(dir), but FSD would override to return true if the
    incoming dir .equals primary & secondary).

    Likewise all the switching in DW to handle two dirs should be rolled
    back (eg you adde DW.fileLength(name, dir1, dir2) that's dup code with
    FSD).

    {quote}
    One issue is the ram buffer flush doubles the ram used (because
    the segment is flushed as is to the RAM dir).
    {quote}

    I think we must keep transient RAM usage below the specified limit, so
    that limits our flushing freedom. Ie, in the NRT case, once DW's RAM
    buffer exceeds half of the allowed remaining RAM budget (ie, the limit
    minus total RAM segments) then we trigger a flush to RAM and then to
    the "real" dir.

    Or... we could flush the new segment directly to the real dir as one
    segment, and merge all prior RAM segments as a separate new segment in
    the main dir, if the free RAM is large enough.

    {quote}
    this ram size should be used not only for deciding when
    it's time to merge to a disk segment, but also when it's time
    for DW to flush a new segment
    In the new patch this is fixed.
    {quote}

    I don't see where this is taken into account? Did you mean to attach
    a new patch?

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 5, 2009 at 12:32 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    * In DocumentsWriter.balanceRAM if NRT is on the total ram
    consumed is "(numBytesUsed * 2) + writer.getRamDirSize()".
    numBytesUsed is the current consumption of the ram buffer.
    Basically what we flush to ram, we'll consume that much of the
    buffer. This is now taken into account in the bufferIsFull
    calculation.

    * Double dir usage should be factored out.

    * TestIndexWriterRamDir.testFSDirectory fails. It tries to
    simulate a crashing IW. When the IW is created again it should
    delete the old files, for some reason it's not with FSDirectory
    (open file handles on Windows perhaps)

    {quote} we could flush the new segment directly to the real dir
    as one segment, and merge all prior RAM segments as a separate
    new segment in the main dir, if the free RAM is large enough.
    {quote}

    Yeah it's unclear what the best policy is here. Do we want to
    have some sort of custom merge policy method/class to take care
    of this so the user can customize it?

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 5, 2009 at 12:53 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705851#action_12705851 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    Did we decide to simply add a boolean param in the ctor to turn
    on NRT instead of relying on getReader. Using getReader could
    cause problems with switching directories midstream.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 5, 2009 at 8:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705952#action_12705952 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    Did we decide to simply add a boolean param in the ctor to turn
    on NRT instead of relying on getReader. Using getReader could
    cause problems with switching directories midstream.
    {quote}
    Yes, let's switch to that.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 5, 2009 at 9:00 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705963#action_12705963 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------


    * We shouldn't add FSD.setPrimaryExtensions?

    * The dual directories is continuing to push deeper (when I'm
    wanting to do the reverse). EG, MergeScheduler.getDestinationDirs
    should not be needed?

    * We should no longer need IndexWriter.getFlushDirectory? IE, IW
    once again has a single "Directory" as seen by IFD,
    DocFieldProcessorPerThread, etc. In the NRT case, this is an FSD;
    in the non-NRT case it's the Dir that was passed in (unless, in a
    future issue, we explore using FSD, too, for better performance).

    * We can't up and change CMS.handleMergeException (breaks
    back-compat); can you deprecate old & add a new one that calls old
    one? Let's have the new one take a Throwable and
    MergePolicy.OneMerge?

    * Instead of overriding "equals" (FSD.equals) can you change to
    "Directory.contains"?

    * IW's RAMDir usage still isn't factored in properly. EG
    DW.doBalancRAM is not taking it into account.

    * Furthermore, we can't call writer.getRamDirSize() from
    DW.balanceRAM -- that's far too costly. Instead, whenever RAMDir
    changes (deletes are applied, or a new RAM segment is created), we
    must push down to DW that usage with a new synchronized method.
    (I described this above). We should remove
    IW.getRamDirSize()... ie, this size should always be "pushed on
    change", not "polled on read". We change it rarely and read it
    incredibly often.

    * We don't need IW.getRamLogMergePolicy()? Instead, let's ignore
    "MergePolicy.useCompoundFile()" when we are writing the new
    segment to RAMDir? Likewise we should not cast RAMMergePolicy to
    LogMergePolicy in setRAMMergePolicy, nor turn off its CFS there.

    * I still don't think we need a separate RAMMergeScheduler; I think
    CMS should simply always run such merges (ie not block on max
    thread count). IW.getNextMerge can then revert to its former
    self.

    * MergePolicy.OneMerge.segString no longer needs to take a
    Directory (because it now stores a Directory).

    * The mergeRAMSegmentsToDisk shouldn't be fully synchronized, eg
    when doWait is true it should release the lock while merges are
    taking place.

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 5, 2009 at 7:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706175#action_12706175 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}RAMDir changes (deletes are applied, or a new RAM segment is
    created), we must push down to DW that usage with a new synchronized
    method.{quote}

    Sounds like we create a subclass of RAMDirectory with this
    functionality?

    {quote}We don't need IW.getRamLogMergePolicy()?{quote}

    Because we don't want the user customizing this?

    {quote}We should no longer need IndexWriter.getFlushDirectory? IE, IW
    once again has a single "Directory" as seen by IFD,
    DocFieldProcessorPerThread, etc. In the NRT case, this is an FSD; in
    the non-NRT case it's the Dir that was passed in (unless, in a future
    issue, we explore using FSD, too, for better performance).{quote}

    Pass in FSD in the constructor of DocumentsWriter (and others) as
    before?

    {quote}I still don't think we need a separate RAMMergeScheduler; I
    think CMS should simply always run such merges (ie not block on max
    thread count). IW.getNextMerge can then revert to its former
    self.{quote}

    Where does the thread come from for this if we're using max threads?
    If we allocate one, we're over limit and keeping it around. We'd need
    a more advanced threadpool that elastically grows the thread pool and
    kills threads that are unused over time. With Java 1.5 we can use
    ThreadPoolExecutor. Is a dedicated thread pool something we want to
    go to? Even then we can potentially still max out a given thread pool
    with requests to merge one directory or the other. We'd probably
    still need two separate thread pools.

    {quote}MergePolicy.OneMerge.segString no longer needs to take a
    Directory (because it now stores a Directory).{quote}

    Yeah, I noticed this, I'll change it. MergeSpecification.segString is
    public and takes a directory that is not required. What to do?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 5, 2009 at 9:37 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706207#action_12706207 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}The dual directories is continuing to push deeper (when I'm
    wanting to do the reverse). EG, MergeScheduler.getDestinationDirs
    should not be needed?{quote}

    If we remove getFlushDirectory, are you saying getDirectory should
    return the FSD if RAM NRT is turned on? This seems counter intuitive
    in that we still need a clear separation of the two directories? The
    user would expect the directory they passed into the ctor to be
    returned?

    getDestinationDirs is used by the ram merge scheduler, which if we
    use a single CMS would go away.

    I'm looking at how to get RAMLogMergePolicy to take into account the
    size of the ram segments it's merging such that they do not total
    beyond the remaining available ram. Looks like we could keep a
    running byte total while it's building the merges and stop once we've
    reached the limit, though I'm not sure how exact this is (will the
    merges be balanced using this system?). It seems like a variation on
    the LogByteSizeMergePolicy however it's unclear whether
    LogDocMergePolicy or LogByteSizeMergePolicy ram merges will perform
    better (does it matter since it's all in ram and we're capping the
    total?)
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 6, 2009 at 5:22 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706305#action_12706305 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    I'm not sure we have the right model yet for deciding when to
    flush the ram buffer and/or ram segments. Perhaps we can simply
    divide the ram buffer size in half, allocating one part to the
    ram buffer, the other to the ram segments. When one exceeds it's
    (rambuffersize/2) allotment, it's flushed to disk. This way if
    the ram buffer size is 32MB, we will always safely flush 16MB to
    disk. The more ram allotted, greater the size of what's flushed
    to disk. We may eventually want to offer an expert method to set
    the ram buffer size and ram dir max size individually.

    Put another way I think we need a balanced upper limit for the
    ram buffer and the NRT ram dir, which seems (to me) to be hard
    to achieve by allowing too much growth at the expensive of the
    other.

    I'd like to stay away from flushing the ram buffer to disk when
    it's below say 20% of the ram buffer size as it seems
    inefficient to do this (because we'll have to do an expensive
    disk merge on it later). On the other hand if the user is not
    calling get reader very often and we're auto flushing at 1/2 the
    ram buffer size, we're short changing ourselves and only
    flushing a segment half the size of what it could be. I suppose
    we could stick with the 1/2 model, only turning it on once ram
    segments are being merged in ram?

    If when merging ram segments (using the specialized
    RAMMergePolicy) we only merge in ram the ones that fit, what do
    we do with the ram segments remaining that need to be flushed to
    disk? What if they are only make up 20% of the total size of the
    ram segments? If we merge the 20% to disk it seems inefficient?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 6, 2009 at 11:17 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706377#action_12706377 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------


    {quote}
    RAMDir changes (deletes are applied, or a new RAM segment is
    created), we must push down to DW that usage with a new synchronized
    method.
    Sounds like we create a subclass of RAMDirectory with this
    functionality?
    {quote}

    I don't think that's needed. I think whenever IW makes a change to
    the RAMDir, which is easily tracked, it pushes to DW the new RAMDir
    size.

    {quote}
    We don't need IW.getRamLogMergePolicy()?
    Because we don't want the user customizing this?
    {quote}
    That, and because it's only used to determine CFS or not, which we've
    turned off for RAMDir.

    {quote}
    We should no longer need IndexWriter.getFlushDirectory? IE, IW
    once again has a single "Directory" as seen by IFD,
    DocFieldProcessorPerThread, etc. In the NRT case, this is an FSD; in
    the non-NRT case it's the Dir that was passed in (unless, in a future
    issue, we explore using FSD, too, for better performance).
    Pass in FSD in the constructor of DocumentsWriter (and others) as
    before?
    {quote}

    Right. All these places could care less if they are dealing w/ FSD or
    a "real" dir. They should simply use the Directory API as they
    previously did.

    {quote}
    I still don't think we need a separate RAMMergeScheduler; I
    think CMS should simply always run such merges (ie not block on max
    thread count). IW.getNextMerge can then revert to its former
    self.
    Where does the thread come from for this if we're using max threads?
    If we allocate one, we're over limit and keeping it around. We'd need
    a more advanced threadpool that elastically grows the thread pool and
    kills threads that are unused over time. With Java 1.5 we can use
    ThreadPoolExecutor. Is a dedicated thread pool something we want to
    go to? Even then we can potentially still max out a given thread pool
    with requests to merge one directory or the other. We'd probably
    still need two separate thread pools.
    {quote}

    The thread is simply launched w/o checking maxThreadCount, if the
    merge is in RAM.

    Right, with JDK 1.5 we can make CMS better about pooling threads.
    Right now it does no long-term pooling (unless another merge happens
    to be needed when a thread finishes its last merge).

    {quote}
    MergePolicy.OneMerge.segString no longer needs to take a
    Directory (because it now stores a Directory).
    Yeah, I noticed this, I'll change it. MergeSpecification.segString is
    public and takes a directory that is not required. What to do?
    {quote}
    Do the usual back-compat dance -- deprecate it and add the new one.

    {quote}
    The dual directories is continuing to push deeper (when I'm
    wanting to do the reverse). EG, MergeScheduler.getDestinationDirs
    should not be needed?
    If we remove getFlushDirectory, are you saying getDirectory should
    return the FSD if RAM NRT is turned on? This seems counter intuitive
    in that we still need a clear separation of the two directories? The
    user would expect the directory they passed into the ctor to be
    returned?
    {quote}

    I agree, we should leave getDirectory() as is (returns whatever Dir
    was passed in).

    We can keep getFlushDirectory, but it should not have duality inside it
    -- it should simply return the FSD (in the NRT case) or the normal
    dir. I don't really like the name getFlushDirectory... but can't
    think of a better one yet.

    Then, nothing outside of IW should ever know there are two directories
    at play. They all simply deal with the one and only Directory that IW
    hands out.

    On the "when to flush to RAM" question... I agree it's tricky. This
    logic belongs in the RAMMergePolicy. That policy needs to be
    empowered to decide if a new flush goes to RAM or disk, to decide when
    to merge all RAM segments to a new disk segment, to be able to check
    if IW is in NRT mode, etc. Probably the RAM merge policy also needs
    control over how much of the RAM buffer it's going to give to DW,
    too. At first the policy should not change the non-NRT case (ie one
    always flushes straight to disk). We can play w/ that in a separate
    issue. Need to think more about the logic...

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 6, 2009 at 7:14 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706557#action_12706557 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}I don't think that's needed. I think whenever IW makes a
    change to the RAMDir, which is easily tracked, it pushes to DW
    the new RAMDir size.{quote}

    Because we know the IW.ramdir is a RAMDirectory implementation,
    we can use sizeInBytes? It's synchronized, maybe we want a
    different method that's not? It seems like keeping track of all
    files writes outside ramdir is going to be difficult? For
    example when we do deletes via SegmentReader how would we keep
    track of that?

    {quote}That, and because it's only used to determine CFS or not,
    which we've turned off for RAMDir.{quote}

    So we let the user set the RAMMergePolicy but not get it?

    {quote}The thread is simply launched w/o checking
    maxThreadCount, if the merge is in RAM.{quote}

    Hmm... We can't just create threads and let them be garbage
    collected as JVMs tend to throw OOMs with this. If we go down
    this route of a single CMS, maybe we can borrow some code from
    an Apache project that's implemented a threadpool.


    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 6, 2009 at 9:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706611#action_12706611 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    I don't think that's needed. I think whenever IW makes a
    change to the RAMDir, which is easily tracked, it pushes to DW
    the new RAMDir size.
    Because we know the IW.ramdir is a RAMDirectory implementation,
    we can use sizeInBytes? It's synchronized, maybe we want a
    different method that's not? It seems like keeping track of all
    files writes outside ramdir is going to be difficult? For
    example when we do deletes via SegmentReader how would we keep
    track of that?
    {quote}

    We should definitely just use the sizeInBytes() method.

    I'm saying that IW knows when it writes new files to the RAMDir
    (flushing deletes, flushing new segment) and it's only at those times
    that it should call sizeInBytes() and push that value down to DW.

    {quote}
    That, and because it's only used to determine CFS or not,
    which we've turned off for RAMDir.
    So we let the user set the RAMMergePolicy but not get it?
    {quote}

    Oh, we should add a getter (getRAMMergePolicy, not getLogMergePolicy)
    for it, but it should return MergePolicy not LogMergePolicy.

    {quote}
    The thread is simply launched w/o checking
    maxThreadCount, if the merge is in RAM.
    Hmm... We can't just create threads and let them be garbage
    collected as JVMs tend to throw OOMs with this. If we go down
    this route of a single CMS, maybe we can borrow some code from
    an Apache project that's implemented a threadpool.
    {quote}

    This is how CMS has always been. It launches threads relatively
    rarely -- this shouldn't lead to OOMs. One can always subclass CMS if
    this is somehow a problem. Or we could modify CMS to pool its threads
    (as a new issue)?

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 6, 2009 at 11:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706662#action_12706662 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    In the patch the merge policies are split up which requires some
    of the RAM NRT logic to be in updatePendingMerges.

    One solution is to have a merge policy that manages merging to
    ram and to disk, kind of an overarching merge policy for the
    primary MP and the RAM MP. This would push the logic of ram
    merging and primary dir merging to the meta merge policy which
    would clean up IW from managing ram segs vs. prim segs.

    Does IW.optimize and IW.expungeDeletes operate on the ramdir as
    well (the expungeDeletes javadoc implies calling
    IR.numDeletedDocs will return zero when there are no deletes).
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 8, 2009 at 5:39 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707243#action_12707243 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    Something in the DocumentsWriter API we may need to change is to
    allow passing a directory through the IndexingChain. In the RAM
    NRT case, which directory we write to can change depending on if
    a ram buffer has exceeded it's maximum available size. If it is
    under half the available ram it will to go the ram dir, if not
    the new segment will be written to disk. For this reason we
    can't simply pass a directory into the constructor of
    DocumentsWriter, nor can we rely on calling
    IW.getFlushDirectory. We should be able to rely on the directory
    in SegmentWriteState?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 8, 2009 at 9:08 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707283#action_12707283 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    Does IW.optimize and IW.expungeDeletes operate on the ramdir as
    well (the expungeDeletes javadoc implies calling
    IR.numDeletedDocs will return zero when there are no deletes).
    {quote}
    I think IW.optimize should mean all RAM segments are merged into the single on-disk segment?

    And IW.expungeDeletes should also apply to RAM segments, ie if RAM segments have pending deletes, they are merged away (possibly entirely in RAM, ie the RAM merge policy could simply merge to a new RAM segment)?

    {quote}
    In the patch the merge policies are split up which requires some
    of the RAM NRT logic to be in updatePendingMerges.
    One solution is to have a merge policy that manages merging to
    ram and to disk,
    {quote}

    It looks like it's the "is it time to flush to disk" logic, right? Why can't we make that the purview of the RAM MergePolicy? We may need to extend MergePolicy API to tell it how much RAM is free in the budget.

    bq. We should be able to rely on the directory in SegmentWriteState?

    I think we should fix the indexing chain to always use SegmentWriteState's Directory and *not* pass Directory to the ctors? Does something go wrong if we take that approach?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 11, 2009 at 4:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708100#action_12708100 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    {quote}IW.optimize should mean all RAM segments are merged into
    the single on-disk segment{quote}

    Yes.

    {quote}IW.expungeDeletes should also apply to RAM segments, ie
    if RAM segments have pending deletes, they are merged away
    (possibly entirely in RAM, ie the RAM merge policy could simply
    merge to a new RAM segment)?{quote}

    Yes.

    {quote}we should fix the indexing chain to always use
    SegmentWriteState's Directory and not pass Directory to the
    ctors{quote}

    Yep.

    The next patch will have these features.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 12, 2009 at 3:23 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    * A single merge scheduler is used. We will need to open a new
    issue for a version of ConcurrentMergeScheduler that allocates
    threads perhaps based on the merge.directory? We'd also probably
    want to add thread pooling.

    * There's a package protected IW ctor that accepts the ram dir.
    This is used in the test case for insuring we aren't creating
    .cfs files in the ram dir.

    * IW.optimize merges all segments (ram included) to the primary
    dir

    * IW.expungeDeletes merges segments with deletes, in ram ones
    stay in ram (unless they won't fit), and primary dir ones are
    handled as usual

    * Added testOptimize, testExpungeDeletes, and some other test
    cases

    * Needs a test case to make sure we're merging to the primary
    dir when the ram dir is full or a flush won't fit in the ram dir

    * There's a mergeRamSegmentsToDir and resolveRamSegments. Two
    different methods because mergeRamSegmentsToDir operates by
    simply scheduling merges, resolveRamSegments operates in the
    foreground like resolveExternalSegments. I'm not sure if we can
    combine the two. resolveRamSegments seems to have a thread
    notification problem and so hangs at times. I'll look into this
    further unless it's obvious what the problem is.

    * When RAM NRT is on (via the IndexWriter constructor), setting
    the ram buffer size allocates half of the given number to the
    DocumentsWriter buffer and half to the ram dir. It may be best
    to dynamically change these numbers based on usage etc.

    * Added NRTMergePolicy which is used only when RAM NRT is on. It
    utilizes the regular merge policy and the ram merge policy.

    * The ram dir size is pushed to DocumentsWriter

    * RAMMergePolicy extends LogDocMergePolicy and defaults the
    useCompoundFile and useCompoundDocStore to false

    * Sorry for the whitespace stuff, I'll clean it up later, I
    wanted to post the latest to get feedback

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 12, 2009 at 7:25 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708578#action_12708578 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    I think the easiest way to handle the ram buf size vs. the ram
    dir size is the allow each to grow on request. I have some code
    I need to test that implements it. This way we're growing based
    on demand and availability. The only thing we may want to add is
    a way to grow and perhaps automatically flush based on the
    growth requested and perhaps prioritizing requests?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 19, 2009 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Attachment: LUCENE-1313.patch

    * All tests pass, added more tests

    * Added DocumentsWriter.growRamBufferBy/growRamDirMaxBy methods
    that allow dynamically requesting more ram. We start off at
    50/50, ramdir/rambuffer. Then whenever one needs more, grow* is
    called.

    * We need a RAMPolicy class that allows customizing how ram is
    allocated. Currently the ramdir and the rambuffer compete for
    space, the user will presumably want to customize this.

    * I'm not sure the flushing always occurs when it should, and
    not sure yet how to test to insure it's flushing when it should
    (other than watching a log). What happened to the adding logging
    to Lucene patch?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Realtime search with transactional semantics.
    Possible future directions:
    * Optimistic concurrency
    * Replication
    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.
    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 20, 2009 at 4:32 am
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jason Rutherglen updated LUCENE-1313:
    -------------------------------------

    Description:
    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.

    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.

    was:
    Realtime search with transactional semantics.

    Possible future directions:
    * Optimistic concurrency
    * Replication

    Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot.

    I think this issue can hold realtime benchmarks which include indexing and searching concurrently.

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 22, 2009 at 1:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712082#action_12712082 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    I think generally we are close. I have lots of little comments from
    looking through the patch:

    * Can you update the CHANGES entry to something like "IndexWriter
    now uses RAM more efficiently when in near real-time mode"? (Ie
    we don't pass RAMDir to IW).

    * DW.push/getRAMDirSize, RAMTotalMax, RAMBufferAvailable, etc. need
    to be synchronized?

    * Since IW.flushDocStores always goes to the main directory, why
    does it now take a Directory arg?

    * I don't think doAfterFlush should be responsible for calling
    pushRamDirSize(); that's more of a hook for external subclasses.

    * Yes, IW.ramSizeInBytes() should include the ramDir's bytes

    * There are still places where Directory.contains should be used,
    instead of pulling both dirs and checkign each. EG, the assert in
    DW.applyDeletes, and this assert in IW:
    {code}
    if (ramNrt && merge.directory == switchDirectory) {
    assert !merge.useCompoundFile;
    }
    {code}
    I'd like to eliminate IW.getInternalDirectory, if possible: to
    anyone interacting with IW, there is only one Directory, and the
    switching is entirely "under the hood".

    * I realized there is in fact a benefit to using CFS in RAM: much
    better RAM efficiency for tiny segments (because RAMDir's buffer
    size is 1 KB). Though such segments would presumably be merged
    away with time, so it may not be a big deal...

    * Is IW.mergeRAMSegmentToDir only for testing?

    * Can you name things theRAMSetting instead of theRamSetting? (Ie,
    RAM is all caps).

    * For IW.resolveRAMSegments, maybe we should make a single merge
    that merges everything down? Why even bother interacting with a
    merge policy, here?

    * Can you rename flush()'s new arg "flushToRAM" to
    "allowFlushToRAM"? Ie, even when this is true, that method may
    decide RAM is full and in fact flush to the real dir.

    * Can you rename IW.ramNRT to IW.flushToRAM? (Since it's in fact
    orthogonal to NRT).

    * It's sneaky to set docWriter.flushToDir before calling
    docWriter.flush; can't we make that an arg to docWriter.flush?
    (And docWriter would never store it).

    * Why did you need to add DW.fileLength?

    * IW.SWITCH_FILE_EXTS should be private static final (not public)?

    * We lost private on a number of attrs in IW -- can you restore?
    (You should insert nocommit comments when you do that, to reduce
    risk that such changes slip in).

    * Likewise for SegmentReader.coreRef.

    * Why did you need to make RAMDir.sizeInBytes volatile? Isn't it
    always updated/accessed from sync(RAMDir) context?

    * Why do we need a new class RAMMergePolicy? (There's no API
    difference over MergePolicy). Can't we simply by default
    instantiate LogByteSizeMergePolicy, and set CFS/CFX to false?

    * IW.fileSwitchDirectory should be private?

    * Have you done any perf tests with flushToRAM = true? EG should we
    enable it by default? I think if we have a good policy for
    managing RAM it could very well be higher performance. But, we
    should explore this under a different issue, so leave the default
    at "no ram dir".

    On the "how to share RAM" between RAMDir & DW's RAM buffer... instead
    of pre-dividing and growing over time, I think we can simplify it by
    logically sharing a single "pool".

    The RAMDir only alters its ram usage when 1) we flush a new segment to
    it, 2) a merge completes (either writing to the real dir or to the ram
    dir), or 3) deletes are applied to segments in RAM. When such a
    change happens we notify DW. DW takes then adds that base into its
    ram consumption to decide when it's time to flush.

    For starters, and we can optimize this later, I don't think DW should
    choose on its own to flush itself to the RAMDir? That should only
    happen when getReader is called, and there's still plenty of RAM
    free.

    So what happens is... each time getReader() is called, we make a new
    smallish RAM segment. Over time, these RAM segments need merging so
    we merge them. (If such a merge is fairly large, probably instead of
    writing to ram it should write the new segment to the real dir, since
    intermediate RAM usage will be too high).

    At some point, DW detects that the RAMDir size plus its own buffer is
    at the limit. If DW's buffer is relatively small, it should probably
    simply flush to the RAMDir then dump entire RAMDir to the real dir as
    a single merge. If DW's buffer is big, as would happen if you opened
    an NRT reader but never actually called getReader(), it should flush
    straight to the real dir.

    One challenge we face is ensuring that while we are flushing all ram
    segments to disk, we don't block the getReader() turnaround. IE we
    can't make getReader() do that flush synchronously. So that needs to
    be a BG merge, but we must somehow temporarily disregard the size of
    those segments while the merge is running. Or, perhaps we "merge RAM
    segments to disk" a bit early, eg once RAM consumed is > 90% of the
    total RAM buffer, or something.

    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 26, 2009 at 4:43 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713097#action_12713097 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    I started on the pooled ram model after the last patch because
    it is cleaner. Bytes are allocated up to the given limit set by
    IW.setRAMBufferSizeMB. As mentioned below, we may want to add a
    setting for the max ram temporarily used.

    I'm reusing the DocumentsWriter.numBytesAlloc/numBytesUsed and
    created a RAMPolicy that manages ramDirBytesAlloc and
    ramDirBytes. Each time a merge is scheduled, the
    sizeof(segments) is allocated by RAMPolicy and the segmentsAlloc
    is stored in OneMerge. Once the merge completes or fails, the
    ramDirBytesAlloc is adjusted by the difference between the
    actual bytes used and OM.ramDirAlloc. This way we always have
    the most accurate ramDir allocation in RamP, and we properly
    adjust the amount of ram consumed. This works well with our
    concurrent merging model where we can't predict when a merge
    will complete.

    {quote}One challenge we face is ensuring that while we are
    flushing all ram segments to disk, we don't block the
    getReader() turnaround. IE we can't make getReader() do that
    flush synchronously....perhaps we "merge RAM segments to disk" a
    bit early, eg once RAM consumed is > 90% of the total RAM
    buffer{quote}

    You're talking about the synchronization in IW.doFlushInternal
    which would block getReader while writing a segment to disk? Our
    default RAMPolicy should be one where we always flush the ram
    buffer to the ramdir. Basically there must always be room in the
    ram dir for the ram buffer. ramdir + (rambuf * 2) < maxSize. Or
    do we assume that it's ok for ramUsed to temporarily exceed
    ramMax by a given percent (110% which would be an option in
    RAMPolicy)? while ramBuf is being flushed to ramDir?

    We may want to make some assumptions about usage of getReader
    (i.e. getReader is called fairly often such that the rambuffer
    is usually less than half of the ram used) when flushToRam=true
    so that we can get a version of this functionality out the door,
    then iterate as we gather feedback from users?

    I'll include the comments in the next patch.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 29, 2009 at 3:50 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714491#action_12714491 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    I had forgotten about concurrency with the docstores (keeping an
    open IndexInput and IndexOutput) when using an FSDirectory. I
    wrote a test (which fails) so getting this functionality to work
    could require some reworking of FSDirectory internals?
    (Something on the order of auto updating IndexInput's buffers
    and file length as IndexOutput is flushed?)

    We need simultaneous IndexInput and IndexOutput ops to work as
    docstore is streamed to FSDir for multiple segments. After a
    segment is flushed to the ramdir it can be read from, including
    the docstore which is still actively being written to (for the
    new segments)?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 29, 2009 at 4:06 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714494#action_12714494 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    bq. I had forgotten about concurrency with the docstores

    That's a big (and good) change; I think we should save that one for another issue, and leave this one focusing on flushing segments through a RAMDir?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 29, 2009 at 5:32 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714522#action_12714522 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    I guess I'm confused about the doc stores. We keep one open on
    disk for multiple segments being created in the via FSD, which
    means the IndexOutput isn't closed, however for each new
    SegmentReader that's opened, we're creating a new IndexInput
    only after the segment is flushed (to FSD, docstore to disk).

    So the concurrent docstores may work without too much changing
    of FSDir internals, as the portion of the docstore file that the
    SR needs to know about has been flushed to disk when SR is
    opened. The SR should then be able to open the docstore file
    cleanly regardless of the open IndexOutput adding more to the
    file for the next rambuf segment?
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless (JIRA) at May 29, 2009 at 6:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714535#action_12714535 ]

    Michael McCandless commented on LUCENE-1313:
    --------------------------------------------

    {quote}
    The SR should then be able to open the docstore file
    cleanly regardless of the open IndexOutput adding more to the
    file for the next rambuf segment?
    {quote}

    That's the crucial question: can we open a new IndexInput, while an IndexOutput is still writing to the file? I vaguely remember being surprised that this worked fine on Windows, but I'm not sure.

    If that's fine across all OS's, then, yes we could avoid closing the docStores when flushing a new segment.

    If it's not fine, then we'd need a way to make an IndexOutputInput, which is a bigger change.

    We also should [separately] consider having multiple SegmentReaders that share the same docStores, share a single set of IndexInputs (cloned). Ie if the RAM MergePolicy allows many segments in RAM at once, we are still opening real file descriptors to read the doc stores, so without such sharing we could start running out of descriptors.
    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen (JIRA) at May 31, 2009 at 8:25 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714924#action_12714924 ]

    Jason Rutherglen commented on LUCENE-1313:
    ------------------------------------------

    I want to run the Lucene unit tests in NRT mode without creating and/or
    modifying all the test cases. In lieu of adding a
    System.property that turns NRT on, have we settled on a
    different mechanism for global settings? Perhaps the back compat
    type of system can be used here? Or for now a static variable on
    IW?

    {quote}can we open a new IndexInput, while an IndexOutput is
    still writing to the file?{quote}

    I ran a test case successfully that writes to a file while
    opening threads that read from flushed sections on windows.

    Closing docstores for every flush would seem to cause a lot of
    overhead. With NRT + FSD aren't termvector files merged on disk for
    every segment?

    {quote}We also should [separately] consider having multiple
    SegmentReaders that share the same docStores{quote}

    Doesn't FSDir open only one FD per file?


    Realtime Search
    ---------------

    Key: LUCENE-1313
    URL: https://issues.apache.org/jira/browse/LUCENE-1313
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Index
    Affects Versions: 2.4.1
    Reporter: Jason Rutherglen
    Priority: Minor
    Fix For: 2.9

    Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch


    Enable near realtime search in Lucene without external
    dependencies. When RAM NRT is enabled, the implementation adds a
    RAMDirectory to IndexWriter. Flushes go to the ramdir unless
    there is no available space. Merges are completed in the ram
    dir until there is no more available ram.
    IW.optimize and IW.commit flush the ramdir to the primary
    directory, all other operations try to keep segments in ram
    until there is no more space.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedMay 1, '09 at 1:59p
activeMay 31, '09 at 8:25p
posts30
users1
websitelucene.apache.org

1 user in discussion

Jason Rutherglen (JIRA): 30 posts

People

Translate

site design / logo © 2021 Grokbase