FAQ
I've used Lucene for a long time, but only in the most basic way. I
have a custom analyzer and a slightly hacked query parser, but in
general it's the basic add document/remove document/query documents
cycle.

In my system, I'm indexing a store of external documents, maintaining
an index for full-text querying. However, I might be turned off when
documents are added, and then when I'm restarted, I'm going to need to
determine the timestamp of the last document added to the index so that
I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't
know how I would do the two Lucene approaches, or even if they're
possible.

1. Just keep a file in parallel with the index, reading and writing the
timestamp of the last indexed document in it. I know how to do this,
but I don't like the idea of keeping a separate file.

2. Drop a timestamp onto each document as it's indexed. I've attached
timestamp fields to documents in the past so that I could do range
queries on them. However, I don't know how to do a query like "the
document with the latest timestamp" or even if that's possible.

3. Create a dummy document (with some unique field identifier so you
could quickly query for it) with a field "last timestamp". This is a
"global value storage" approach, as you could just store any field with
any value on it. But I'd be updating this timestamp field a lot, which
means that every time I updated the index I'd have to remove this
special document and reindex it. Is there any way to update the value
of a field in a document directly in the index without removing and
adding it again to the index? The field I'd want to update would just
be stored, not indexed or tokenized.

Thanks for your help in guiding my exploration into the capabilities of
Lucene.

Avi

--
Avi 'rlwimi' Drissman
[email protected]
Argh! This darn mail server is trunca


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Claes Holmerson at Aug 25, 2004 at 2:40 pm

    Avi Drissman wrote:

    I've used Lucene for a long time, but only in the most basic way. I
    have a custom analyzer and a slightly hacked query parser, but in
    general it's the basic add document/remove document/query documents
    cycle.

    In my system, I'm indexing a store of external documents, maintaining
    an index for full-text querying. However, I might be turned off when
    documents are added, and then when I'm restarted, I'm going to need to
    determine the timestamp of the last document added to the index so
    that I can pick up where I left off.

    There are three approaches to doing this, two using Lucene. I don't
    know how I would do the two Lucene approaches, or even if they're
    possible.

    1. Just keep a file in parallel with the index, reading and writing
    the timestamp of the last indexed document in it. I know how to do
    this, but I don't like the idea of keeping a separate file.
    This is similar to the way I chose (I used a property file for this, and
    stored certain data within it, in the index directory). I didn't like
    the idea at first either, but later I thought - why not? It is the
    simplest way. As long as the file name is not used by Lucene, I thought
    it should be safe.

    Claes


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Otis Gospodnetic at Aug 25, 2004 at 3:25 pm
    What if all Documents in your index contained some flag field + an 'add
    date' field. Then you could make a query such as: flag:1 and sort it
    by 'add date' field, taking only the very first hit as the most
    recently added Document.

    Otis

    --- Avi Drissman wrote:
    I've used Lucene for a long time, but only in the most basic way. I
    have a custom analyzer and a slightly hacked query parser, but in
    general it's the basic add document/remove document/query documents
    cycle.

    In my system, I'm indexing a store of external documents, maintaining

    an index for full-text querying. However, I might be turned off when
    documents are added, and then when I'm restarted, I'm going to need
    to
    determine the timestamp of the last document added to the index so
    that
    I can pick up where I left off.

    There are three approaches to doing this, two using Lucene. I don't
    know how I would do the two Lucene approaches, or even if they're
    possible.

    1. Just keep a file in parallel with the index, reading and writing
    the
    timestamp of the last indexed document in it. I know how to do this,
    but I don't like the idea of keeping a separate file.

    2. Drop a timestamp onto each document as it's indexed. I've attached

    timestamp fields to documents in the past so that I could do range
    queries on them. However, I don't know how to do a query like "the
    document with the latest timestamp" or even if that's possible.

    3. Create a dummy document (with some unique field identifier so you
    could quickly query for it) with a field "last timestamp". This is a
    "global value storage" approach, as you could just store any field
    with
    any value on it. But I'd be updating this timestamp field a lot,
    which
    means that every time I updated the index I'd have to remove this
    special document and reindex it. Is there any way to update the value

    of a field in a document directly in the index without removing and
    adding it again to the index? The field I'd want to update would just

    be stored, not indexed or tokenized.

    Thanks for your help in guiding my exploration into the capabilities
    of
    Lucene.

    Avi

    --
    Avi 'rlwimi' Drissman
    [email protected]
    Argh! This darn mail server is trunca


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bernhard Messer at Aug 25, 2004 at 3:40 pm
    Avi,

    i would prefer the second approach. If you already store the date time
    when the doc was index, you could use the following trick to get the
    last document added to the index:

    IndexReader ir = IndexReader.open("/tmp/testindex");

    int maxDoc = ir.maxDoc();
    while (--maxDoc > 0) {
    if (!ir.isDeleted(maxDoc)) {
    Document doc = ir.document(maxDoc);
    System.out.println(doc.getField("indexDate"));
    break;
    }
    }

    What do you think about the implementation, no extra properties, nothing
    to worry about. Every information is within you index.

    regards
    Bernhard

    Avi Drissman wrote:
    I've used Lucene for a long time, but only in the most basic way. I
    have a custom analyzer and a slightly hacked query parser, but in
    general it's the basic add document/remove document/query documents
    cycle.

    In my system, I'm indexing a store of external documents, maintaining
    an index for full-text querying. However, I might be turned off when
    documents are added, and then when I'm restarted, I'm going to need to
    determine the timestamp of the last document added to the index so
    that I can pick up where I left off.

    There are three approaches to doing this, two using Lucene. I don't
    know how I would do the two Lucene approaches, or even if they're
    possible.

    1. Just keep a file in parallel with the index, reading and writing
    the timestamp of the last indexed document in it. I know how to do
    this, but I don't like the idea of keeping a separate file.

    2. Drop a timestamp onto each document as it's indexed. I've attached
    timestamp fields to documents in the past so that I could do range
    queries on them. However, I don't know how to do a query like "the
    document with the latest timestamp" or even if that's possible.

    3. Create a dummy document (with some unique field identifier so you
    could quickly query for it) with a field "last timestamp". This is a
    "global value storage" approach, as you could just store any field
    with any value on it. But I'd be updating this timestamp field a lot,
    which means that every time I updated the index I'd have to remove
    this special document and reindex it. Is there any way to update the
    value of a field in a document directly in the index without removing
    and adding it again to the index? The field I'd want to update would
    just be stored, not indexed or tokenized.

    Thanks for your help in guiding my exploration into the capabilities
    of Lucene.

    Avi

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Avi Drissman at Aug 25, 2004 at 3:50 pm

    On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:

    If you already store the date time when the doc was index, you could
    use the following trick to get the last document added to the index:

    while (--maxDoc > 0) {
    Yes, but that's a linear search :(
    On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:

    What if all Documents in your index contained some flag field + an 'add
    date' field. Then you could make a query such as: flag:1 and sort it
    by 'add date' field, taking only the very first hit as the most
    recently added Document.
    That's a very clever approach. I'm currently using Lucene 1.3, so I
    hadn't thought about using the new sorting abilities. I'd need to move
    to 1.4, of course.

    A question, though: how efficient is it to make a query that matches
    all documents and then sort it? I'm looking for something as small as I
    can; after all, storing the last date in a file separate from the index
    is O(1)...

    Thanks!

    Avi

    --
    Avi 'rlwimi' Drissman
    [email protected]
    Argh! This darn mail server is trunca


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Otis Gospodnetic at Aug 25, 2004 at 3:54 pm
    The more documents match, the slower the search; how long your
    particular search would take I cannot tell, though - you should just
    test it out and see.

    I never needed to use the trick with a flag field in all documents, but
    I know others do it.

    Otis

    --- Avi Drissman wrote:
    On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:

    If you already store the date time when the doc was index, you could
    use the following trick to get the last document added to the index:
    while (--maxDoc > 0) {
    Yes, but that's a linear search :(
    On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:

    What if all Documents in your index contained some flag field + an 'add
    date' field. Then you could make a query such as: flag:1 and sort it
    by 'add date' field, taking only the very first hit as the most
    recently added Document.
    That's a very clever approach. I'm currently using Lucene 1.3, so I
    hadn't thought about using the new sorting abilities. I'd need to
    move
    to 1.4, of course.

    A question, though: how efficient is it to make a query that matches
    all documents and then sort it? I'm looking for something as small as
    I
    can; after all, storing the last date in a file separate from the
    index
    is O(1)...

    Thanks!

    Avi

    --
    Avi 'rlwimi' Drissman
    [email protected]
    Argh! This darn mail server is trunca


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Grant Ingersoll at Aug 25, 2004 at 3:57 pm

    [email protected] 8/25/2004 11:50:01 AM >>>
    On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:

    If you already store the date time when the doc was index, you could
    use the following trick to get the last document added to the index:

    while (--maxDoc > 0) {
    Yes, but that's a linear search :(

    >>>
    You are right, in the worst case, this would be linear, but that would
    require you to delete a lot of documents. I would bet, that on average,
    arguably nearly all cases, you would go through very few iterations
    before finding the doc you are interested in

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Avi Drissman at Aug 25, 2004 at 4:02 pm

    On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote:

    You are right, in the worst case, this would be linear,
    No, in _all_ cases this would be linear.
    I would bet, that on average,
    arguably nearly all cases, you would go through very few iterations
    before finding the doc you are interested in
    Then you don't understand what I'm trying to do. I'm trying to find the
    document with the biggest value for the field. That would involve
    checking the field's value in every document to ensure this.

    Avi

    --
    Avi 'rlwimi' Drissman
    [email protected]
    Argh! This darn mail server is trunca


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Grant Ingersoll at Aug 25, 2004 at 4:26 pm
    Avi,

    I may be confused, as I understand it you said you were interested in
    the last document indexed, Berhnard's code does that. Lucene adds
    documents sequentially, so counting backwards from the maxDoc() should
    get you the last indexed document pretty quickly. If all documents were
    deleted, then this would go through all documents, otherwise, it is
    going to find it pretty quickly. It doesn't have to traverse through
    all of the documents, it just has to find the "first" document that is
    not deleted (since we are starting at the end of the list and going
    backward)
    [email protected] 8/25/2004 12:01:50 PM >>>
    On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote:

    You are right, in the worst case, this would be linear,
    No, in _all_ cases this would be linear.
    I would bet, that on average,
    arguably nearly all cases, you would go through very few iterations
    before finding the doc you are interested in
    Then you don't understand what I'm trying to do. I'm trying to find the

    document with the biggest value for the field. That would involve
    checking the field's value in every document to ensure this.

    Avi

    --
    Avi 'rlwimi' Drissman
    [email protected]
    Argh! This darn mail server is trunca


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Avi Drissman at Aug 25, 2004 at 4:38 pm

    On Aug 25, 2004, at 12:25 PM, Grant Ingersoll wrote:

    I may be confused, as I understand it you said you were interested in
    the last document indexed,
    Yes, I see what you meant. I'm sorry.

    That's actually an interesting option. Is getting the timestamp of the
    last document indexed a good enough solution or must I find the latest
    timestamp of all indexed documents? I'd have to ponder that for a
    while.

    Avi

    --
    Avi 'rlwimi' Drissman
    [email protected]
    Argh! This darn mail server is trunca


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 25, '04 at 2:19p
activeAug 25, '04 at 4:38p
posts10
users5
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase