FAQ
This has probably been asked before but I couldn't find it, so...

Is it possible / advisable / practical to use Lucene as the basis of a live
document search capability? By "live document" I mean a largish document
such as a word processor might be able to handle which is being edited
currently. Examples would be Word documents of some size that are begin
written, really huge Java files, etc.

The user is sitting there typing away and of course everything is changing
in real time. This seems to be orthogonal to the idea of a Lucene index
which is costly to construct and costly to update.

TIA

Search Discussions

  • Anshum at Dec 29, 2010 at 4:02 am
    Hi,
    An update on each word/character typed is not a practical thing to do for
    lucene (as per the current scheme of things). Though there's something
    called the Real Time search, which lets you search on an updated document,
    though the assumption is that the frequency is not as that of a word type.
    --
    Anshum Gupta
    http://ai-cafe.blogspot.com

    On Wed, Dec 29, 2010 at 3:36 AM, software visualization wrote:

    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the basis of a
    live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.

    TIA
  • Robert Muir at Dec 29, 2010 at 4:37 am

    On Tue, Dec 28, 2010 at 5:06 PM, software visualization wrote:
    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct  and costly to update.
    yes, but if they are typing away, they likely aren't also searching at
    the same time unless they have two keyboards and four hands... so why
    update anything in real time?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Umesh Prasad at Dec 29, 2010 at 5:10 am
    You can also look at Zoie and see if it fits your needs. This is a
    contribution by linkedin.

    http://snaprojects.jira.com/wiki/display/ZOIE/Home

    Also look at MemoryIndex .. This is good for creating single document index
    and searching on it.

    http://lucene.apache.org/java/3_0_3/api/all/org/apache/lucene/index/memory/MemoryIndex.html


    Thanks
    Umesh
    On Wed, Dec 29, 2010 at 10:06 AM, Robert Muir wrote:

    On Tue, Dec 28, 2010 at 5:06 PM, software visualization
    wrote:
    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.
    yes, but if they are typing away, they likely aren't also searching at
    the same time unless they have two keyboards and four hands... so why
    update anything in real time?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ---
    Thanks & Regards
    Umesh Prasad
  • Anshum at Dec 29, 2010 at 5:25 am
    Hi Umesh,
    I'm not really confident that Zoie or anything built on the current version
    of Lucene would be able to handle search as you type kind of a setup.

    --
    Anshum Gupta
    http://ai-cafe.blogspot.com

    On Wed, Dec 29, 2010 at 10:39 AM, Umesh Prasad wrote:

    You can also look at Zoie and see if it fits your needs. This is a
    contribution by linkedin.

    http://snaprojects.jira.com/wiki/display/ZOIE/Home

    Also look at MemoryIndex .. This is good for creating single document index
    and searching on it.


    http://lucene.apache.org/java/3_0_3/api/all/org/apache/lucene/index/memory/MemoryIndex.html


    Thanks
    Umesh
    On Wed, Dec 29, 2010 at 10:06 AM, Robert Muir wrote:

    On Tue, Dec 28, 2010 at 5:06 PM, software visualization
    wrote:
    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.
    yes, but if they are typing away, they likely aren't also searching at
    the same time unless they have two keyboards and four hands... so why
    update anything in real time?

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    ---
    Thanks & Regards
    Umesh Prasad
  • Shashi Kant at Dec 29, 2010 at 7:01 am

    yes, but if they are typing away, they likely aren't also searching at
    the same time unless they have two keyboards and four hands... so why
    update anything in real time?

    Presumably the OP meant user-A was editing the doc and other Users ,
    or a monitoring app, are searching said doc simultaneously.
    Is that along the right track?

    BTW have you looked at profanity filters, spellcheckers etc. which
    might be more suited to what you are looking for.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sean at Dec 29, 2010 at 7:36 am
    Does it make any sense?
    Every time a search result is shown, the original document could have been changed, no matter how fast the indexing speed is.
    If you can accept this inconsistency, you do not need to index so frequently at all.


    ------------------ Original ------------------
    From: "software visualization"<softwarevisualization@gmail.com>;
    Date: Wed, Dec 29, 2010 06:06 AM
    To: "java-user"<java-user@lucene.apache.org>;

    Subject: Using Lucene to search live, being-edited documents


    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the basis of a live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.

    TIA
  • Adam Saltiel at Dec 29, 2010 at 3:38 pm
    This is interesting. What are we driving at here? A single user? That doesn't make sense to unless you want to flag certain things as they construct the document. Or else why don't they know what is in their own document? There must be other ways apart from Lucene. It seems to me you want each line parsed as soon as entered and matched against some criteria. I would look at plugins for Open Office first. Or any other text editor. But not sure you have given enough information.
    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: "Sean" <spacehero@foxmail.com>
    Date: Wed, 29 Dec 2010 15:35:17
    To: java-user<java-user@lucene.apache.org>
    Reply-To: java-user@lucene.apache.org
    Subject: Re:Using Lucene to search live, being-edited documents

    Does it make any sense?
    Every time a search result is shown, the original document could have been changed, no matter how fast the indexing speed is.
    If you can accept this inconsistency, you do not need to index so frequently at all.


    ------------------ Original ------------------
    From: "software visualization"<softwarevisualization@gmail.com>;
    Date: Wed, Dec 29, 2010 06:06 AM
    To: "java-user"<java-user@lucene.apache.org>;

    Subject: Using Lucene to search live, being-edited documents


    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the basis of a live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.

    TIA
  • Software visualization at Dec 29, 2010 at 4:55 pm
    I am writing a text editor and have to provide a certain search
    functionality .

    The use case is for single user. A single document is potentially very
    large and numerous such documents may be open and unflushed at any given
    time. Think many files of an IDE, except the files are larger. The user is
    free to change, say, variables names across documents which may be separate
    files opened simultaneously in a variety of tabs (say) and being edited
    with no guarantee that the user has flushed or saved any of it.




    On Wed, Dec 29, 2010 at 10:37 AM, wrote:

    This is interesting. What are we driving at here? A single user? That
    doesn't make sense to unless you want to flag certain things as they
    construct the document. Or else why don't they know what is in their own
    document? There must be other ways apart from Lucene. It seems to me you
    want each line parsed as soon as entered and matched against some criteria.
    I would look at plugins for Open Office first. Or any other text editor. But
    not sure you have given enough information.
    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: "Sean" <spacehero@foxmail.com>
    Date: Wed, 29 Dec 2010 15:35:17
    To: java-user<java-user@lucene.apache.org>
    Reply-To: java-user@lucene.apache.org
    Subject: Re:Using Lucene to search live, being-edited documents

    Does it make any sense?
    Every time a search result is shown, the original document could have been
    changed, no matter how fast the indexing speed is.
    If you can accept this inconsistency, you do not need to index so
    frequently at all.


    ------------------ Original ------------------
    From: "software visualization"<softwarevisualization@gmail.com>;
    Date: Wed, Dec 29, 2010 06:06 AM
    To: "java-user"<java-user@lucene.apache.org>;

    Subject: Using Lucene to search live, being-edited documents


    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the basis of a
    live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.

    TIA
  • Adam Saltiel at Dec 29, 2010 at 5:16 pm
    What has this to do with Lucene? You're thinking its index would be faster than your own search algorithm. Would it though? Do you really need an index or a good pattern matcher? I can't see what the stream buffer being flushed by the user has to do with it? Don't you have to control that behaviour?

    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: software visualization <softwarevisualization@gmail.com>
    Date: Wed, 29 Dec 2010 11:55:17
    To: <java-user@lucene.apache.org>; <adam.saltiel@gmail.com>
    Reply-To: softwarevisualization@gmail.com
    Subject: Re: Using Lucene to search live, being-edited documents

    I am writing a text editor and have to provide a certain search
    functionality .

    The use case is for single user. A single document is potentially very
    large and numerous such documents may be open and unflushed at any given
    time. Think many files of an IDE, except the files are larger. The user is
    free to change, say, variables names across documents which may be separate
    files opened simultaneously in a variety of tabs (say) and being edited
    with no guarantee that the user has flushed or saved any of it.




    On Wed, Dec 29, 2010 at 10:37 AM, wrote:

    This is interesting. What are we driving at here? A single user? That
    doesn't make sense to unless you want to flag certain things as they
    construct the document. Or else why don't they know what is in their own
    document? There must be other ways apart from Lucene. It seems to me you
    want each line parsed as soon as entered and matched against some criteria.
    I would look at plugins for Open Office first. Or any other text editor. But
    not sure you have given enough information.
    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: "Sean" <spacehero@foxmail.com>
    Date: Wed, 29 Dec 2010 15:35:17
    To: java-user<java-user@lucene.apache.org>
    Reply-To: java-user@lucene.apache.org
    Subject: Re:Using Lucene to search live, being-edited documents

    Does it make any sense?
    Every time a search result is shown, the original document could have been
    changed, no matter how fast the indexing speed is.
    If you can accept this inconsistency, you do not need to index so
    frequently at all.


    ------------------ Original ------------------
    From: "software visualization"<softwarevisualization@gmail.com>;
    Date: Wed, Dec 29, 2010 06:06 AM
    To: "java-user"<java-user@lucene.apache.org>;

    Subject: Using Lucene to search live, being-edited documents


    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the basis of a
    live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.

    TIA
  • Lance Norskog at Dec 30, 2010 at 2:30 am
    Check out the Instantiated contrib for Lucene. This is an alternative
    in-memory data structure that does not need commits and is faster (and
    larger) than the Lucene Directory system.
    On Wed, Dec 29, 2010 at 9:15 AM, wrote:
    What has this to do with Lucene? You're thinking its index would be faster than your own search algorithm. Would it though? Do you really need an index or a good pattern matcher? I can't see what the stream buffer being flushed by the user has to do with it? Don't you have to control that behaviour?

    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: software visualization <softwarevisualization@gmail.com>
    Date: Wed, 29 Dec 2010 11:55:17
    To: <java-user@lucene.apache.org>; <adam.saltiel@gmail.com>
    Reply-To: softwarevisualization@gmail.com
    Subject: Re: Using Lucene to search live, being-edited documents

    I am writing a text editor and have to provide  a certain search
    functionality .

    The  use case is for single user. A single  document is potentially very
    large and numerous such documents may be open and unflushed at any given
    time. Think many files of an IDE, except the files are larger. The user is
    free to change, say, variables names across documents which may be separate
    files opened simultaneously in a variety of tabs (say)  and being edited
    with no guarantee that the user has flushed or saved any of it.




    On Wed, Dec 29, 2010 at 10:37 AM, wrote:

    This is interesting. What are we driving at here? A single user? That
    doesn't make sense to unless you want to flag certain things as they
    construct the document. Or else why don't they know what is in their own
    document? There must be other ways apart from Lucene. It seems to me you
    want each line parsed as soon as entered and matched against some criteria.
    I would look at plugins for Open Office first. Or any other text editor. But
    not sure you have given enough information.
    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: "Sean" <spacehero@foxmail.com>
    Date: Wed, 29 Dec 2010 15:35:17
    To: java-user<java-user@lucene.apache.org>
    Reply-To: java-user@lucene.apache.org
    Subject: Re:Using Lucene to search live, being-edited documents

    Does it make any sense?
    Every time a search result is shown, the original document could have been
    changed,  no matter how fast the indexing speed is.
    If you can accept this inconsistency, you do not need to index so
    frequently at all.


    ------------------ Original ------------------
    From:  "software visualization"<softwarevisualization@gmail.com>;
    Date:  Wed, Dec 29, 2010 06:06 AM
    To:  "java-user"<java-user@lucene.apache.org>;

    Subject:  Using Lucene to search live, being-edited documents


    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the  basis of a
    live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct  and costly to update.

    TIA


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Jan 3, 2011 at 3:17 pm
    There is also the MemoryIndex, which is in contrib and is designed for one document at a time. That being said, basic grep/regex is probably fast enough.

    -Grant
    On Dec 29, 2010, at 9:27 PM, Lance Norskog wrote:

    Check out the Instantiated contrib for Lucene. This is an alternative
    in-memory data structure that does not need commits and is faster (and
    larger) than the Lucene Directory system.
    On Wed, Dec 29, 2010 at 9:15 AM, wrote:
    What has this to do with Lucene? You're thinking its index would be faster than your own search algorithm. Would it though? Do you really need an index or a good pattern matcher? I can't see what the stream buffer being flushed by the user has to do with it? Don't you have to control that behaviour?

    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: software visualization <softwarevisualization@gmail.com>
    Date: Wed, 29 Dec 2010 11:55:17
    To: <java-user@lucene.apache.org>; <adam.saltiel@gmail.com>
    Reply-To: softwarevisualization@gmail.com
    Subject: Re: Using Lucene to search live, being-edited documents

    I am writing a text editor and have to provide a certain search
    functionality .

    The use case is for single user. A single document is potentially very
    large and numerous such documents may be open and unflushed at any given
    time. Think many files of an IDE, except the files are larger. The user is
    free to change, say, variables names across documents which may be separate
    files opened simultaneously in a variety of tabs (say) and being edited
    with no guarantee that the user has flushed or saved any of it.




    On Wed, Dec 29, 2010 at 10:37 AM, wrote:

    This is interesting. What are we driving at here? A single user? That
    doesn't make sense to unless you want to flag certain things as they
    construct the document. Or else why don't they know what is in their own
    document? There must be other ways apart from Lucene. It seems to me you
    want each line parsed as soon as entered and matched against some criteria.
    I would look at plugins for Open Office first. Or any other text editor. But
    not sure you have given enough information.
    Sent using BlackBerry® from Orange

    -----Original Message-----
    From: "Sean" <spacehero@foxmail.com>
    Date: Wed, 29 Dec 2010 15:35:17
    To: java-user<java-user@lucene.apache.org>
    Reply-To: java-user@lucene.apache.org
    Subject: Re:Using Lucene to search live, being-edited documents

    Does it make any sense?
    Every time a search result is shown, the original document could have been
    changed, no matter how fast the indexing speed is.
    If you can accept this inconsistency, you do not need to index so
    frequently at all.


    ------------------ Original ------------------
    From: "software visualization"<softwarevisualization@gmail.com>;
    Date: Wed, Dec 29, 2010 06:06 AM
    To: "java-user"<java-user@lucene.apache.org>;

    Subject: Using Lucene to search live, being-edited documents


    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the basis of a
    live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.

    TIA


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Jan 3, 2011 at 3:31 pm

    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll wrote:
    There is also the MemoryIndex, which is in contrib and is designed for one document at a time.  That being said, basic grep/regex is probably fast enough.
    In cases where you are doing a 'find' in a document similar to what a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Adasal at Jan 4, 2011 at 6:26 pm
    I would think this is more like it.
    But the essential thing, so it seems to me, is whether there is a
    requirement for a serialised index, i.e. a more permanent record, aside from
    the saved document.
    Then, if there is a penalty to creating the index compared to regex,
    stringsearch or so, it is justified on other grounds.
    I think it is an interesting q. when does that requirement emerge?
    There is size of document.
    But there would also be field types. I think I have this right. This is
    really a classification system, so more than bare regex.
    There must be other criteria that apply to this use case, too?

    Adam

    p.s. we (in my work project) are just beginning to use Lucene for geometry
    objects and I am looking forward to understanding its use better, including,
    possibly, expanding it to other use cases apart from geo objects.
    On 3 January 2011 15:31, Robert Muir wrote:
    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll wrote:
    There is also the MemoryIndex, which is in contrib and is designed for
    one document at a time. That being said, basic grep/regex is probably fast
    enough.
    In cases where you are doing a 'find' in a document similar to what a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like
    http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Software visualization at Jan 21, 2011 at 7:59 pm
    Hi sorry for the long delay.

    The idea is that a single user is editing a single document. As they edit,
    any indexes built against the document become stale, actually wrong.
    Example: references to specific localities within this document are all
    instantly wrong the first time a user types a new beginning character-
    they're all off by one. Deleting words is of course disastrous etc. etc.
    So our story is- we used to have this document nicely indexed and now we
    have nothing useful.

    Considering what Lucene does prior to indexing, stemming for instance, I am
    not sure no, I am quite sure I can't recreate the same powerful indexing
    functionality.

    But it seems wrong to lure our users into opening this document with
    promises that this that and the other thing is has been located for them
    only to invalidate all that just because they began to edit the document. I
    understand why that happens , but my users are perhaps not as tech savvy and
    I think it will just feel "wrong" to them.

    So I am looking for a way around this.


    On Tue, Jan 4, 2011 at 1:25 PM, adasal wrote:

    I would think this is more like it.
    But the essential thing, so it seems to me, is whether there is a
    requirement for a serialised index, i.e. a more permanent record, aside
    from
    the saved document.
    Then, if there is a penalty to creating the index compared to regex,
    stringsearch or so, it is justified on other grounds.
    I think it is an interesting q. when does that requirement emerge?
    There is size of document.
    But there would also be field types. I think I have this right. This is
    really a classification system, so more than bare regex.
    There must be other criteria that apply to this use case, too?

    Adam

    p.s. we (in my work project) are just beginning to use Lucene for geometry
    objects and I am looking forward to understanding its use better,
    including,
    possibly, expanding it to other use cases apart from geo objects.
    On 3 January 2011 15:31, Robert Muir wrote:

    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll <gsingers@apache.org>
    wrote:
    There is also the MemoryIndex, which is in contrib and is designed for
    one document at a time. That being said, basic grep/regex is probably fast
    enough.
    In cases where you are doing a 'find' in a document similar to what a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like
    http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Umesh Prasad at Jan 22, 2011 at 2:12 am
    Hi,
    One work around would be to version the documents and store the
    version as well as the timestamp of indexed document into the index.

    Reading between lines I assume that
    Document is
    a) stored in some DB/File :
    b) indexed in lucene index

    User Search On on b)
    Document ids
    but documents are displayed to user after retrieving from a).

    Now I do not know a way in which I can keep a) and b) completely in
    sync in realtime. As there will be some time taken in indexing
    operation itself. a) --> b) .

    Instead we can do following.
    a) stored : Document ID + Document Text + Document Version +
    Modification Time Stamp (T1)
    b) Indexed : Document ID + Document Text + Document Version +
    Modification Time Stamp (T2) (when indexed) (broken into date + hour +
    mins + sec for minimizing number of terms)

    User Searches b)
    Search System gets Document ID + Modification Time Stamp (T2) and gives to
    Presentation layer which compares the T1 & T2.
    If T2 < T1, Skip the result.

    Assumption : Stored document is always in sync. Documents are
    persisted somewhere and not served from memory.

    Thanks & Regards
    Umesh Prasad



    On Sat, Jan 22, 2011 at 1:29 AM, software visualization
    wrote:
    Hi sorry for the long delay.

    The idea is that a single user is editing a single document. As they edit,
    any indexes built against the document become stale, actually wrong.
    Example:  references to specific localities within this document are all
    instantly wrong the first time a user types a new beginning  character-
    they're all off by one. Deleting  words is of course disastrous etc. etc.
    So our story is- we used to have this document nicely indexed and now we
    have nothing useful.

    Considering what Lucene does prior to indexing, stemming for instance,  I am
    not sure no, I am quite sure I can't  recreate the same powerful indexing
    functionality.

    But it seems wrong  to lure our users into opening this document with
    promises that this that and the other thing is has been located for them
    only to invalidate all that just because they began to edit the document. I
    understand why that happens , but my users are perhaps not as tech savvy and
    I think it will just feel "wrong" to them.

    So I am looking for a way around this.


    On Tue, Jan 4, 2011 at 1:25 PM, adasal wrote:

    I would think this is more like it.
    But the essential thing, so it seems to me, is whether there is a
    requirement for a serialised index, i.e. a more permanent record, aside
    from
    the saved document.
    Then, if there is a penalty to creating the index compared to regex,
    stringsearch or so, it is justified on other grounds.
    I think it is an interesting q. when does that requirement emerge?
    There is size of document.
    But there would also be field types. I think I have this right. This is
    really a classification system, so more than bare regex.
    There must be other criteria that apply to this use case, too?

    Adam

    p.s. we (in my work project) are just beginning to use Lucene for geometry
    objects and I am looking forward to understanding its use better,
    including,
    possibly, expanding it to other use cases apart from geo objects.
    On 3 January 2011 15:31, Robert Muir wrote:

    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll <gsingers@apache.org>
    wrote:
    There is also the MemoryIndex, which is in contrib and is designed for
    one document at a time.  That being said, basic grep/regex is probably fast
    enough.
    In cases where you are doing a 'find' in a document similar to what a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like
    http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    ---
    Thanks & Regards
    Umesh Prasad

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Software visualization at Jan 22, 2011 at 5:29 am
    If I understand you correctly, I think that this :

    If T2 < T1, Skip the result.

    will always be the case. The live being edited document is always "later"
    in time than the indexed information about it.


    On Fri, Jan 21, 2011 at 9:11 PM, Umesh Prasad wrote:

    Hi,
    One work around would be to version the documents and store the
    version as well as the timestamp of indexed document into the index.

    Reading between lines I assume that
    Document is
    a) stored in some DB/File :
    b) indexed in lucene index

    User Search On on b)
    Document ids
    but documents are displayed to user after retrieving from a).

    Now I do not know a way in which I can keep a) and b) completely in
    sync in realtime. As there will be some time taken in indexing
    operation itself. a) --> b) .

    Instead we can do following.
    a) stored : Document ID + Document Text + Document Version +
    Modification Time Stamp (T1)
    b) Indexed : Document ID + Document Text + Document Version +
    Modification Time Stamp (T2) (when indexed) (broken into date + hour +
    mins + sec for minimizing number of terms)

    User Searches b)
    Search System gets Document ID + Modification Time Stamp (T2) and gives to
    Presentation layer which compares the T1 & T2.
    If T2 < T1, Skip the result.

    Assumption : Stored document is always in sync. Documents are
    persisted somewhere and not served from memory.

    Thanks & Regards
    Umesh Prasad



    On Sat, Jan 22, 2011 at 1:29 AM, software visualization
    wrote:
    Hi sorry for the long delay.

    The idea is that a single user is editing a single document. As they edit,
    any indexes built against the document become stale, actually wrong.
    Example: references to specific localities within this document are all
    instantly wrong the first time a user types a new beginning character-
    they're all off by one. Deleting words is of course disastrous etc. etc.
    So our story is- we used to have this document nicely indexed and now we
    have nothing useful.

    Considering what Lucene does prior to indexing, stemming for instance, I am
    not sure no, I am quite sure I can't recreate the same powerful indexing
    functionality.

    But it seems wrong to lure our users into opening this document with
    promises that this that and the other thing is has been located for them
    only to invalidate all that just because they began to edit the document. I
    understand why that happens , but my users are perhaps not as tech savvy and
    I think it will just feel "wrong" to them.

    So I am looking for a way around this.


    On Tue, Jan 4, 2011 at 1:25 PM, adasal wrote:

    I would think this is more like it.
    But the essential thing, so it seems to me, is whether there is a
    requirement for a serialised index, i.e. a more permanent record, aside
    from
    the saved document.
    Then, if there is a penalty to creating the index compared to regex,
    stringsearch or so, it is justified on other grounds.
    I think it is an interesting q. when does that requirement emerge?
    There is size of document.
    But there would also be field types. I think I have this right. This is
    really a classification system, so more than bare regex.
    There must be other criteria that apply to this use case, too?

    Adam

    p.s. we (in my work project) are just beginning to use Lucene for
    geometry
    objects and I am looking forward to understanding its use better,
    including,
    possibly, expanding it to other use cases apart from geo objects.
    On 3 January 2011 15:31, Robert Muir wrote:

    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll <gsingers@apache.org
    wrote:
    There is also the MemoryIndex, which is in contrib and is designed
    for
    one document at a time. That being said, basic grep/regex is probably fast
    enough.
    In cases where you are doing a 'find' in a document similar to what a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like
    http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    ---
    Thanks & Regards
    Umesh Prasad

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lance Norskog at Jan 22, 2011 at 7:22 am
    There's a feature in lucene called an "instantiated" index. This has
    all of the Lucene data structures directly as objects instead of
    serialized to disk or a RAMDirectory. It never needs to be committed:
    you index a document and it is immediately searchable. It is larger
    and faster than a normal index, but might be the right thing for this
    use case. You cannot store it to disk, it only lives in memory.

    On Fri, Jan 21, 2011 at 9:28 PM, software visualization
    wrote:
    If I understand you correctly, I think that this :

    If T2 < T1, Skip the result.

    will always  be the case. The live being edited document is always "later"
    in time than the indexed information about it.


    On Fri, Jan 21, 2011 at 9:11 PM, Umesh Prasad wrote:

    Hi,
    One work around would be to version the documents and store the
    version as well as the timestamp of indexed document into the index.

    Reading between lines I assume that
    Document is
    a) stored in some DB/File :
    b) indexed in lucene index

    User Search  On on b)
    Document ids
    but documents are displayed to user after retrieving from a).

    Now I do not know a way in which I can keep a) and b) completely in
    sync in realtime. As there will be some time taken in indexing
    operation itself. a) --> b) .

    Instead we can do following.
    a) stored : Document ID + Document Text + Document Version +
    Modification Time Stamp (T1)
    b) Indexed : Document ID + Document Text + Document Version +
    Modification Time Stamp (T2) (when indexed) (broken into date + hour +
    mins + sec for minimizing number of terms)

    User Searches b)
    Search System gets Document ID + Modification Time Stamp (T2) and gives to
    Presentation layer which compares the  T1 & T2.
    If T2 < T1, Skip the result.

    Assumption : Stored document is always in sync. Documents are
    persisted somewhere and not served from memory.

    Thanks & Regards
    Umesh Prasad



    On Sat, Jan 22, 2011 at 1:29 AM, software visualization
    wrote:
    Hi sorry for the long delay.

    The idea is that a single user is editing a single document. As they edit,
    any indexes built against the document become stale, actually wrong.
    Example:  references to specific localities within this document are all
    instantly wrong the first time a user types a new beginning  character-
    they're all off by one. Deleting  words is of course disastrous etc. etc.
    So our story is- we used to have this document nicely indexed and now we
    have nothing useful.

    Considering what Lucene does prior to indexing, stemming for instance,  I am
    not sure no, I am quite sure I can't  recreate the same powerful indexing
    functionality.

    But it seems wrong  to lure our users into opening this document with
    promises that this that and the other thing is has been located for them
    only to invalidate all that just because they began to edit the document. I
    understand why that happens , but my users are perhaps not as tech savvy and
    I think it will just feel "wrong" to them.

    So I am looking for a way around this.


    On Tue, Jan 4, 2011 at 1:25 PM, adasal wrote:

    I would think this is more like it.
    But the essential thing, so it seems to me, is whether there is a
    requirement for a serialised index, i.e. a more permanent record, aside
    from
    the saved document.
    Then, if there is a penalty to creating the index compared to regex,
    stringsearch or so, it is justified on other grounds.
    I think it is an interesting q. when does that requirement emerge?
    There is size of document.
    But there would also be field types. I think I have this right. This is
    really a classification system, so more than bare regex.
    There must be other criteria that apply to this use case, too?

    Adam

    p.s. we (in my work project) are just beginning to use Lucene for
    geometry
    objects and I am looking forward to understanding its use better,
    including,
    possibly, expanding it to other use cases apart from geo objects.
    On 3 January 2011 15:31, Robert Muir wrote:

    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll <gsingers@apache.org
    wrote:
    There is also the MemoryIndex, which is in contrib and is designed
    for
    one document at a time.  That being said, basic grep/regex is probably fast
    enough.
    In cases where you are doing a 'find' in a document similar to what a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like
    http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    ---
    Thanks & Regards
    Umesh Prasad

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Lance Norskog
    goksron@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Umesh Prasad at Jan 22, 2011 at 7:25 am
    Nopes. It won't be the case always. Users will not be always editing
    the document. They will edit the document, then save which will be
    persisted in db. You can use db triggers to push it into a indexing
    queue, from which indexer can regularly pick up the document for
    indexing. You can schedule your indexer so that it picks the indexing
    jobs every minute or so.
    Unless you have a mission critical system, the approach should be
    more than sufficient.



    On Sat, Jan 22, 2011 at 10:58 AM, software visualization
    wrote:
    If I understand you correctly, I think that this :

    If T2 < T1, Skip the result.

    will always  be the case. The live being edited document is always "later"
    in time than the indexed information about it.


    On Fri, Jan 21, 2011 at 9:11 PM, Umesh Prasad wrote:

    Hi,
    One work around would be to version the documents and store the
    version as well as the timestamp of indexed document into the index.

    Reading between lines I assume that
    Document is
    a) stored in some DB/File :
    b) indexed in lucene index

    User Search  On on b)
    Document ids
    but documents are displayed to user after retrieving from a).

    Now I do not know a way in which I can keep a) and b) completely in
    sync in realtime. As there will be some time taken in indexing
    operation itself. a) --> b) .

    Instead we can do following.
    a) stored : Document ID + Document Text + Document Version +
    Modification Time Stamp (T1)
    b) Indexed : Document ID + Document Text + Document Version +
    Modification Time Stamp (T2) (when indexed) (broken into date + hour +
    mins + sec for minimizing number of terms)

    User Searches b)
    Search System gets Document ID + Modification Time Stamp (T2) and gives to
    Presentation layer which compares the  T1 & T2.
    If T2 < T1, Skip the result.

    Assumption : Stored document is always in sync. Documents are
    persisted somewhere and not served from memory.

    Thanks & Regards
    Umesh Prasad



    On Sat, Jan 22, 2011 at 1:29 AM, software visualization
    wrote:
    Hi sorry for the long delay.

    The idea is that a single user is editing a single document. As they edit,
    any indexes built against the document become stale, actually wrong.
    Example:  references to specific localities within this document are all
    instantly wrong the first time a user types a new beginning  character-
    they're all off by one. Deleting  words is of course disastrous etc. etc.
    So our story is- we used to have this document nicely indexed and now we
    have nothing useful.

    Considering what Lucene does prior to indexing, stemming for instance,  I am
    not sure no, I am quite sure I can't  recreate the same powerful indexing
    functionality.

    But it seems wrong  to lure our users into opening this document with
    promises that this that and the other thing is has been located for them
    only to invalidate all that just because they began to edit the document. I
    understand why that happens , but my users are perhaps not as tech savvy and
    I think it will just feel "wrong" to them.

    So I am looking for a way around this.


    On Tue, Jan 4, 2011 at 1:25 PM, adasal wrote:

    I would think this is more like it.
    But the essential thing, so it seems to me, is whether there is a
    requirement for a serialised index, i.e. a more permanent record, aside
    from
    the saved document.
    Then, if there is a penalty to creating the index compared to regex,
    stringsearch or so, it is justified on other grounds.
    I think it is an interesting q. when does that requirement emerge?
    There is size of document.
    But there would also be field types. I think I have this right. This is
    really a classification system, so more than bare regex.
    There must be other criteria that apply to this use case, too?

    Adam

    p.s. we (in my work project) are just beginning to use Lucene for
    geometry
    objects and I am looking forward to understanding its use better,
    including,
    possibly, expanding it to other use cases apart from geo objects.
    On 3 January 2011 15:31, Robert Muir wrote:

    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll <gsingers@apache.org
    wrote:
    There is also the MemoryIndex, which is in contrib and is designed
    for
    one document at a time.  That being said, basic grep/regex is probably fast
    enough.
    In cases where you are doing a 'find' in a document similar to what a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like
    http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    ---
    Thanks & Regards
    Umesh Prasad

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    ---
    Thanks & Regards
    Umesh Prasad

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Software visualization at Jan 22, 2011 at 3:55 pm
    Lance, Umesh thank you.

    Lance I will look into this and report results when I try it out. Thank you
    very much!

    Umesh: Just thinking along these lines, when a user saves the document,
    that event may have a semantic meaning that developers aren't privy to.
    The user might be experimenting with the document in some way as a means of
    gaining useful knowledge but specifically and deliberately NOT saving it.
    Users surprise you, no?

    For this reason, I am not inclined to create a dependency between search and
    save; they should be able to search at any time irrespective of whether
    they've saved.

    But what's to stop us from hooking this up to a timer, say and indexing it
    every so often? It's not perfect, since indexing = I/O = possibly
    noticeable delays (but the CPU can carry on doing useful things of course).


    The only problem in this scenario besides the not-really-real-time aspect to
    it is if the user decides- again for their own good reasons - not to save
    but rather to back out of he changes they have in memory; is the index now
    "ahead" of the document? I suppose in that case I blow away the existing
    index with the version of the document the user did decide to save adn
    there's no discontinuity.



    Just thinking aloud now for anyone interested in the same problem.
    On Sat, Jan 22, 2011 at 2:24 AM, Umesh Prasad wrote:

    Nopes. It won't be the case always. Users will not be always editing
    the document. They will edit the document, then save which will be
    persisted in db. You can use db triggers to push it into a indexing
    queue, from which indexer can regularly pick up the document for
    indexing. You can schedule your indexer so that it picks the indexing
    jobs every minute or so.
    Unless you have a mission critical system, the approach should be
    more than sufficient.



    On Sat, Jan 22, 2011 at 10:58 AM, software visualization
    wrote:
    If I understand you correctly, I think that this :

    If T2 < T1, Skip the result.

    will always be the case. The live being edited document is always "later"
    in time than the indexed information about it.


    On Fri, Jan 21, 2011 at 9:11 PM, Umesh Prasad wrote:

    Hi,
    One work around would be to version the documents and store the
    version as well as the timestamp of indexed document into the index.

    Reading between lines I assume that
    Document is
    a) stored in some DB/File :
    b) indexed in lucene index

    User Search On on b)
    Document ids
    but documents are displayed to user after retrieving from a).

    Now I do not know a way in which I can keep a) and b) completely in
    sync in realtime. As there will be some time taken in indexing
    operation itself. a) --> b) .

    Instead we can do following.
    a) stored : Document ID + Document Text + Document Version +
    Modification Time Stamp (T1)
    b) Indexed : Document ID + Document Text + Document Version +
    Modification Time Stamp (T2) (when indexed) (broken into date + hour +
    mins + sec for minimizing number of terms)

    User Searches b)
    Search System gets Document ID + Modification Time Stamp (T2) and gives
    to
    Presentation layer which compares the T1 & T2.
    If T2 < T1, Skip the result.

    Assumption : Stored document is always in sync. Documents are
    persisted somewhere and not served from memory.

    Thanks & Regards
    Umesh Prasad



    On Sat, Jan 22, 2011 at 1:29 AM, software visualization
    wrote:
    Hi sorry for the long delay.

    The idea is that a single user is editing a single document. As they edit,
    any indexes built against the document become stale, actually wrong.
    Example: references to specific localities within this document are
    all
    instantly wrong the first time a user types a new beginning
    character-
    they're all off by one. Deleting words is of course disastrous etc.
    etc.
    So our story is- we used to have this document nicely indexed and now
    we
    have nothing useful.

    Considering what Lucene does prior to indexing, stemming for instance,
    I
    am
    not sure no, I am quite sure I can't recreate the same powerful
    indexing
    functionality.

    But it seems wrong to lure our users into opening this document with
    promises that this that and the other thing is has been located for
    them
    only to invalidate all that just because they began to edit the
    document.
    I
    understand why that happens , but my users are perhaps not as tech
    savvy
    and
    I think it will just feel "wrong" to them.

    So I am looking for a way around this.


    On Tue, Jan 4, 2011 at 1:25 PM, adasal wrote:

    I would think this is more like it.
    But the essential thing, so it seems to me, is whether there is a
    requirement for a serialised index, i.e. a more permanent record,
    aside
    from
    the saved document.
    Then, if there is a penalty to creating the index compared to regex,
    stringsearch or so, it is justified on other grounds.
    I think it is an interesting q. when does that requirement emerge?
    There is size of document.
    But there would also be field types. I think I have this right. This
    is
    really a classification system, so more than bare regex.
    There must be other criteria that apply to this use case, too?

    Adam

    p.s. we (in my work project) are just beginning to use Lucene for
    geometry
    objects and I am looking forward to understanding its use better,
    including,
    possibly, expanding it to other use cases apart from geo objects.
    On 3 January 2011 15:31, Robert Muir wrote:

    On Mon, Jan 3, 2011 at 10:16 AM, Grant Ingersoll <
    gsingers@apache.org
    wrote:
    There is also the MemoryIndex, which is in contrib and is
    designed
    for
    one document at a time. That being said, basic grep/regex is
    probably
    fast
    enough.
    In cases where you are doing a 'find' in a document similar to what
    a
    wordprocessor would do (especially if you want to iterate
    forwards/backwards through matches etc), you might want to consider
    something like
    http://icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    ---
    Thanks & Regards
    Umesh Prasad

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    ---
    Thanks & Regards
    Umesh Prasad

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sean at Dec 29, 2010 at 7:44 am
    Does it make any sense?
    Every time a search result is shown, the original document could have been changed, no matter how fast the indexing speed is.
    If you can accept this inconsistency, you do not need to index so frequently at all.


    ------------------ Original ------------------
    From: "software visualization"<softwarevisualization@gmail.com>;
    Date: Wed, Dec 29, 2010 06:06 AM
    To: "java-user"<java-user@lucene.apache.org>;

    Subject: Using Lucene to search live, being-edited documents


    This has probably been asked before but I couldn't find it, so...

    Is it possible / advisable / practical to use Lucene as the basis of a live
    document search capability? By "live document" I mean a largish document
    such as a word processor might be able to handle which is being edited
    currently. Examples would be Word documents of some size that are begin
    written, really huge Java files, etc.

    The user is sitting there typing away and of course everything is changing
    in real time. This seems to be orthogonal to the idea of a Lucene index
    which is costly to construct and costly to update.

    TIA

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 28, '10 at 10:07p
activeJan 22, '11 at 3:55p
posts21
users9
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase