FAQ
I have slow subsequent searches.
And if i get the cache up and running is it persisted to disc?

/Marcus


________________________________

Från: Yonik Seeley
Skickat: on 2006-05-17 16:31
Till: java-user@lucene.apache.org
Ämne: Re: Sort problematics


On 5/17/06, Marcus Falck wrote:
I did a quite interesting notice, if i search for IndexId:x
(IndexId is unique) with a sort it still takes very long time, which
it doesn't without the sort.
This will only be the case the first time you sort on a field because
a FieldCache entry is created for that field and then cached for
subsequent sorts.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erik Hatcher at May 18, 2006 at 10:08 am

    On May 18, 2006, at 4:52 AM, Marcus Falck wrote:
    I have slow subsequent searches.
    And if i get the cache up and running is it persisted to disc?
    No, Lucene's caches are not persisted, only in RAM. Are you using a
    new IndexReader/IndexSearcher for your subsequent searches? If not,
    you're likely not leveraging any caches at all.

    Erik


    /Marcus


    ________________________________

    Från: Yonik Seeley
    Skickat: on 2006-05-17 16:31
    Till: java-user@lucene.apache.org
    Ämne: Re: Sort problematics


    On 5/17/06, Marcus Falck wrote:
    I did a quite interesting notice, if i search for IndexId:x
    (IndexId is unique) with a sort it still takes very long time,
    which
    it doesn't without the sort.
    This will only be the case the first time you sort on a field because
    a FieldCache entry is created for that field and then cached for
    subsequent sorts.

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene
    search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Falck at May 18, 2006 at 2:23 pm
    Doesn't solr use the same sort implementation as Lucene ?


    -----Ursprungligt meddelande-----
    Från: Erik Hatcher
    Skickat: den 18 maj 2006 14:57
    Till: java-user@lucene.apache.org
    Ämne: Re: SV: SV: SV: Sort problematics

    On May 18, 2006, at 7:04 AM, Marcus Falck wrote:
    Yes I know. But the index is changed constantly.
    Then use Solr :))

    Erik

    / Marcus

    -----Ursprungligt meddelande-----
    Från: Erik Hatcher
    Skickat: den 18 maj 2006 12:52
    Till: java-user@lucene.apache.org
    Ämne: Re: SV: SV: Sort problematics

    On May 18, 2006, at 6:41 AM, Marcus Falck wrote:
    Yes Erik I'm instantiating a new IndexSearcher for every search.
    Then don't :) You only need a new IndexSearcher instance when the
    index itself has changed.

    -----Ursprungligt meddelande-----
    Från: Erik Hatcher
    Skickat: den 18 maj 2006 12:08
    Till: java-user@lucene.apache.org
    Ämne: Re: SV: Sort problematics

    On May 18, 2006, at 4:52 AM, Marcus Falck wrote:
    I have slow subsequent searches.
    And if i get the cache up and running is it persisted to disc?
    No, Lucene's caches are not persisted, only in RAM. Are you using a
    new IndexReader/IndexSearcher for your subsequent searches? If not,
    you're likely not leveraging any caches at all.

    Erik


    /Marcus


    ________________________________

    Från: Yonik Seeley
    Skickat: on 2006-05-17 16:31
    Till: java-user@lucene.apache.org
    Ämne: Re: Sort problematics


    On 5/17/06, Marcus Falck wrote:
    I did a quite interesting notice, if i search for IndexId:x
    (IndexId is unique) with a sort it still takes very long time,
    which
    it doesn't without the sort.
    This will only be the case the first time you sort on a field
    because
    a FieldCache entry is created for that field and then cached for
    subsequent sorts.

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene
    search server

    --------------------------------------------------------------------
    -
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    --------------------------------------------------------------------
    -
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl wettin at May 18, 2006 at 2:48 pm

    On Thu, 2006-05-18 at 16:22 +0200, Marcus Falck wrote:
    Doesn't solr use the same sort implementation as Lucene ?
    Solr comes with more cache.

    Is it a requirement that the new data is instantly available?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at May 18, 2006 at 3:24 pm

    On 5/18/06, Marcus Falck wrote:
    Doesn't solr use the same sort implementation as Lucene ?
    Yes, but Solr handles the mechanics of warming up a new searcher in
    the background to avoid those lengthy first-time hits to the
    FieldCache and norms, and it warms any configured caches based on
    previous requests.

    There is still the issue of data freshness... you don't want to open a
    new searcher too often (less than once a minute probably).

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Falck at May 18, 2006 at 3:20 pm
    Ok.
    I just set up a machine running solr and now I will index up a couple of gigabytes to see the difference in performance (using a sort).

    But since my "real" index will be around 2TB in size I don't think sorting is the right way to go? I pretty sure I will have to modify the ranking.

    And yes the data must be instantly available.

    /
    Marcus






    -----Ursprungligt meddelande-----
    Från: karl wettin
    Skickat: den 18 maj 2006 16:48
    Till: java-user@lucene.apache.org
    Ämne: Re: Sort problematics
    On Thu, 2006-05-18 at 16:22 +0200, Marcus Falck wrote:
    Doesn't solr use the same sort implementation as Lucene ?
    Solr comes with more cache.

    Is it a requirement that the new data is instantly available?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at May 18, 2006 at 3:43 pm

    On 5/18/06, Marcus Falck wrote:
    But since my "real" index will be around 2TB in size I don't think sorting is the right way to go? I pretty sure I will have to modify the ranking.
    They are both sorts, and they both use a priority queue. The
    differences shouldn't be that great after the FieldCache is populated.
    The biggest downside to the FieldCache is the memory usage, not the
    CPU.
    And yes the data must be instantly available.
    For each update? If so, use a database - Lucene made different
    tradeoffs in it's design.

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Falck at May 18, 2006 at 8:28 pm
    Hi
    Where can i read more about the lucene sort implementation?
    Does there exist any documentation on the sorting except for the Lucene API docs?

    /
    Marcus





    ________________________________

    Från: Yonik Seeley
    Skickat: to 2006-05-18 20:39
    Till: java-user@lucene.apache.org
    Ämne: Re: Sort problematics


    On 5/18/06, Marcus Falck wrote:
    I'm well aware of the trade offs. But if you were aware of the large amounts of data that this system should be able to search you woldn't propose the usage of a database.
    If you have a hard requirement of instantly seeing any update, you
    can't use Lucene. That's more database-like functionallity. That's
    why I asked.
    Since I have an separate alert service for immediatly alerts up and running i may be able to do trade offs with the data availability timings, and hold the indexsearcher open for a longer period.
    That's pretty much a requirement for using Lucene to support a decent
    query rate.

    But still. The memory is the problem.
    I mean how much memory would the fieldcache take for 500 Millon newsletter articles? Probably a lot,
    ok the system is scaled out over different machines so in reality each machine won't have 500 Million docs but maybe around 100Million.
    Depends on what you are sorting by... for an int/float 100M*4 or
    800MB. Big, but possible.
    So i'm still interesting in changing the relevance.
    Any ideas?
    Depends on what you are sorting by, and how many different ways you
    want to sort. If it's a single sort criteria, you can use index-time
    boosts. If you can sort multiple ways, avoiding the fieldcache
    probably won't help you because the time to retrieve the per-doc sort
    info via termvectors or stored fields will take too long.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at May 18, 2006 at 8:56 pm

    On May 18, 2006, at 4:25 PM, Marcus Falck wrote:
    Where can i read more about the lucene sort implementation?
    Does there exist any documentation on the sorting except for the
    Lucene API docs?
    Well, there is "Lucene in Action" which covers sorting in a fair bit
    of detail. I hear that book is pretty cool ;)

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcus Falck at May 18, 2006 at 8:53 pm
    ________________________________

    Från: Yonik Seeley
    Skickat: to 2006-05-18 20:39
    Till: java-user@lucene.apache.org
    Ämne: Re: Sort problematics


    On 5/18/06, Marcus Falck wrote:
    I'm well aware of the trade offs. But if you were aware of the large amounts of data that this system should be able to search you woldn't propose the usage of a database.
    If you have a hard requirement of instantly seeing any update, you
    can't use Lucene. That's more database-like functionallity. That's
    why I asked.

    Instant in this case isn't really instant. Lets say that the MAXIMUM time that will be accepted is 5 minutes. Since i have the altert service up and running all users that pays for immediate alerts will get their hits from it.
    Since I have an separate alert service for immediatly alerts up and running i may be able to do trade offs with the data availability timings, and hold the indexsearcher open for a longer period.
    That's pretty much a requirement for using Lucene to support a decent
    query rate.

    I thought i had good query rates when i instantiated the IndexSearcher for every search. I mean in my example index on 10GB i had response times under a second when queried using a boolean query containing approximently 200 terms. But I will redesign using a more static behavior for the IndexSearcher and recreate it on a reqular basis. After this redesign i suppose the search will benefit from the fieldcache (until i recreate). And as you say 800MB is nothing. I will without problems have atleast 4GB RAM in each search machine.

    But still. The memory is the problem.
    I mean how much memory would the fieldcache take for 500 Millon newsletter articles? Probably a lot,
    ok the system is scaled out over different machines so in reality each machine won't have 500 Million docs but maybe around 100Million.
    Depends on what you are sorting by... for an int/float 100M*4 or
    800MB. Big, but possible.
    So i'm still interesting in changing the relevance.
    Any ideas?
    Depends on what you are sorting by, and how many different ways you
    want to sort. If it's a single sort criteria, you can use index-time
    boosts. If you can sort multiple ways, avoiding the fieldcache
    probably won't help you because the time to retrieve the per-doc sort
    info via termvectors or stored fields will take too long.

    There is however a problem with boosting the docs since the boost factor is of type float. A float doesn't have the resolution needed to differ on a second basis.

    I will illustrate my sort/ranking need with an example:

    If i use lucene default implementation of the TermScorer and search for

    "you" OR "her"

    The term scorer will give higher score on documents containing both terms. This is a problem (in our application) since in this case want the same score on documents as long as they contain 1 of the terms (since we are dealing with newsletter observation for companies they want to get the hits ordered by date to get the complete overview). I tested to rewrite the TermScorer to give me the same score with success. So my question is.
    If i can modify the score at search time using the Score class why does everybody talk about the Similarity class?



    /

    Marcus
  • Yonik Seeley at May 18, 2006 at 9:02 pm

    On 5/18/06, Marcus Falck wrote:

    If i use lucene default implementation of the TermScorer and search for

    "you" OR "her"

    The term scorer will give higher score on documents containing both terms. This is a problem (in our application) since in this case want the same score on documents as long as they contain 1 of the terms
    If this is your problem, it has nothing to do with "sorting" (using
    Lucene terminology) but scoring. There are numbers of ways:

    1) Change Similarity.coord() (read the JavaDoc for Similarity)
    2) DisjunctionMaxQuery
    3) query "you" and "her" separately and use the union of the results

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Günther Starnberger at May 18, 2006 at 9:21 pm
    On Thu, May 18, 2006 at 10:53:23PM +0200, Marcus Falck wrote:

    Hello,
    The term scorer will give higher score on documents containing both
    terms. This is a problem (in our application) since in this case want
    the same score on documents as long as they contain 1 of the terms
    (since we are dealing with newsletter observation for companies they
    want to get the hits ordered by date to get the complete overview). I
    tested to rewrite the TermScorer to give me the same score with
    success. So my question is.
    What exactly do you want to achieve with your application?

    You speak of "immediate alerts". I understand this as: Your users
    specify keywords or queries and when you receive a new document which
    matches a query you alert the user.

    Is this what you want to do? If so I don't think that Lucene is useful
    for this kind of realtime queries. Instead of using an inverted index
    it would make more sense to use a normal index which contains the
    terms you search for. If you receive a new document make a lookup on
    each term of the document using the index. It _might_ be possible to
    do this with Lucene by storing the search-terms as documents and using
    the documents which you receive as queries, but i guess this it isn't
    that trivial.

    If you need a combination of traditional search and real-time alerts a
    hybrid solution may make sense. But using Lucene for real-time search
    isn't a good idea (at least IMO).

    bye,
    /gst
  • Erik Hatcher at May 19, 2006 at 1:14 am

    On May 18, 2006, at 5:22 PM, Günther Starnberger wrote:

    On Thu, May 18, 2006 at 10:53:23PM +0200, Marcus Falck wrote:

    Hello,
    The term scorer will give higher score on documents containing both
    terms. This is a problem (in our application) since in this case want
    the same score on documents as long as they contain 1 of the terms
    (since we are dealing with newsletter observation for companies they
    want to get the hits ordered by date to get the complete
    overview). I
    tested to rewrite the TermScorer to give me the same score with
    success. So my question is.
    What exactly do you want to achieve with your application?

    You speak of "immediate alerts". I understand this as: Your users
    specify keywords or queries and when you receive a new document which
    matches a query you alert the user.

    Is this what you want to do? If so I don't think that Lucene is useful
    for this kind of realtime queries. Instead of using an inverted index
    it would make more sense to use a normal index which contains the
    terms you search for. If you receive a new document make a lookup on
    each term of the document using the index. It _might_ be possible to
    do this with Lucene by storing the search-terms as documents and using
    the documents which you receive as queries, but i guess this it isn't
    that trivial.

    If you need a combination of traditional search and real-time alerts a
    hybrid solution may make sense. But using Lucene for real-time search
    isn't a good idea (at least IMO).

    Actually there is contrib/memory's MemoryIndex that is specifically
    designed for this type of single document high performance querying.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 18, '06 at 8:54a
activeMay 19, '06 at 1:14a
posts13
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase