FAQ
I have a requirement to only return one result for all documents whose
timestamps fall within N seconds of one another. (where timestamp is a
field and N is an integer).

For example, Document A is timestamped "12:00:00" and Document B has
timestamp "12:00:30", Document B should be discarded. On the other
hand, if Document B has timestamp "12:01:00" then I should return both
(assuming 30 < N < 59 seconds).

Similarly, if Documents A, B, and C have timestamps "12:00:00",
"12:00:30", and "12:01:00" respectively, only Document A should be
returned (because B is close to A, and C is close to B).

If it helps to simplify things, we can assume results are sorted by
time. Also, I can apply logic at index time or at search time.

Any suggestions? This is a pretty tough concept to search the archives
for...

--Ben


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Karl wettin at May 23, 2006 at 10:44 pm

    On Tue, 2006-05-23 at 17:38 -0400, Benjamin Stein wrote:
    I have a requirement to only return one result for all documents whose
    timestamps fall within N seconds of one another. (where timestamp is a
    field and N is an integer).

    For example, Document A is timestamped "12:00:00" and Document B has
    timestamp "12:00:30", Document B should be discarded. On the other
    hand, if Document B has timestamp "12:01:00" then I should return both
    (assuming 30 < N < 59 seconds).

    Similarly, if Documents A, B, and C have timestamps "12:00:00",
    "12:00:30", and "12:01:00" respectively, only Document A should be
    returned (because B is close to A, and C is close to B).

    If it helps to simplify things, we can assume results are sorted by
    time. Also, I can apply logic at index time or at search time.

    Any suggestions? This is a pretty tough concept to search the
    archives for...
    How big is the corpus and how many hits do you estimate a search can
    result in? Can you just take the penalty from iterating the hits?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benjamin Stein at May 23, 2006 at 11:28 pm

    -----Original Message-----
    From: karl wettin
    Sent: Tuesday, May 23, 2006 6:44 PM
    To: java-user@lucene.apache.org
    Subject: Re: Removing search results that fall within a time range
    On Tue, 2006-05-23 at 17:38 -0400, Benjamin Stein wrote:
    I have a requirement to only return one result for all
    documents whose
    timestamps fall within N seconds of one another. (where
    timestamp is a
    field and N is an integer).

    For example, Document A is timestamped "12:00:00" and
    Document B has
    timestamp "12:00:30", Document B should be discarded. On the other
    hand, if Document B has timestamp "12:01:00" then I should
    return both
    (assuming 30 < N < 59 seconds).

    Similarly, if Documents A, B, and C have timestamps "12:00:00",
    "12:00:30", and "12:01:00" respectively, only Document A should be
    returned (because B is close to A, and C is close to B).

    If it helps to simplify things, we can assume results are sorted by
    time. Also, I can apply logic at index time or at search time.

    Any suggestions? This is a pretty tough concept to search the
    archives for...
    How big is the corpus and how many hits do you estimate a
    search can result in? Can you just take the penalty from
    iterating the hits?
    The corpus is very big. Approximately 300,000,000 documents and
    growing. I would estimate potentially a huge number of hits per search.

    We currently do iterate through the hits and process them like you
    suggest, but that requires some impressive kludges to work :) Just
    wondering if there was a clever way to push this logic into the
    index/search process.

    My other plan was to create a class that implements Searchable
    interface. This class will just forward all search requests to a
    private IndexSearcher data member and post-process the results before
    returning. I will then pass an array of these customized searchers to a
    ParallelMultiSearcher. Given enough parallel processing, this might
    work in a reasonable timeframe.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at May 24, 2006 at 1:49 am
    A pretty big variable here in trying to find a "clever" solution to your
    problem is: how many results do you want?

    Do you need all of them for some sort of downstream processing, or are you
    only interested in the first M? ... how big is M?

    Assuming M is something managable, i would try writing a HitCollector
    that maintains a bounded, sorted, list of (doc,date) pairs (sorted
    on the date).

    when you collect a new match X, you scan the list looking for for any item
    I such that X.date-M <= I.date || I.date <= X.date+M .. for all things you
    find that meet that critera (they should all be in a clump since teh list
    is sorted) remove all but the one with the lowest date, and then either
    replace that one with X, or throw away X if it's not the lowest date (ie:
    collecting B after A and C in your A B C example below)

    One thing to watch out for: make sure your bounded, sorted, list is
    bounded by M+1, and then throw away the last item when you are done ..
    if you limit it to M items you might fill up and start ignoring items
    outside of the range of the list, and then the last doc you collect might
    be like "B" and cause two items to be removed, leaving you with one less
    result then you wanted.


    : Date: Tue, 23 May 2006 17:38:04 -0400
    : From: Benjamin Stein <ben@shadowtv.com>
    : Reply-To: java-user@lucene.apache.org
    : To: java-user@lucene.apache.org
    : Subject: Removing search results that fall within a time range
    :
    : I have a requirement to only return one result for all documents whose
    : timestamps fall within N seconds of one another. (where timestamp is a
    : field and N is an integer).
    :
    : For example, Document A is timestamped "12:00:00" and Document B has
    : timestamp "12:00:30", Document B should be discarded. On the other
    : hand, if Document B has timestamp "12:01:00" then I should return both
    : (assuming 30 < N < 59 seconds).
    :
    : Similarly, if Documents A, B, and C have timestamps "12:00:00",
    : "12:00:30", and "12:01:00" respectively, only Document A should be
    : returned (because B is close to A, and C is close to B).
    :
    : If it helps to simplify things, we can assume results are sorted by
    : time. Also, I can apply logic at index time or at search time.
    :
    : Any suggestions? This is a pretty tough concept to search the archives
    : for...
    :
    : --Ben
    :
    :
    : ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-user-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 23, '06 at 9:38p
activeMay 24, '06 at 1:49a
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase