FAQ
Hi,

Is there an easy way to find out the number of hits per document for a
Query, rather than just for a Term?

Let's say, for example, I have a document like this:

"here is cats near dogs and here is cats a long long way from dogs"

and I use a SpanNearQuery to find "cats" near "dogs" with a slop of 1 -
I need to be able to find out that there was 1 hit, even though there
are 2 occurrences of "cats" and 2 of "dogs" - there is still only 1 hit
that matches my Query.

Is this possible?

Thanks,
JB.




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Grant Ingersoll at Jun 10, 2008 at 1:05 pm
    A SpanQuery is just a Query, so the traditional way of Querying still
    applies, i.e. you get back a list of matching documents. Beyond that,
    if you just want to operate on the spans, just keep track of how often
    the doc() method changes.

    HTH,
    Grant
    On Jun 9, 2008, at 11:21 AM, John Byrne wrote:

    Hi,

    Is there an easy way to find out the number of hits per document for
    a Query, rather than just for a Term?

    Let's say, for example, I have a document like this:

    "here is cats near dogs and here is cats a long long way from dogs"

    and I use a SpanNearQuery to find "cats" near "dogs" with a slop of
    1 - I need to be able to find out that there was 1 hit, even though
    there are 2 occurrences of "cats" and 2 of "dogs" - there is still
    only 1 hit that matches my Query.

    Is this possible?

    Thanks,
    JB.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lutan at Jun 10, 2008 at 1:12 pm
    I have recently done some tests on lucene. I do not know whether the test results normal. hd entironment:Intel(R) Xeon(R) CPU 5110 @ 1.60GHz4GB ram sw entironment:centOS4.6+sun jdk 1.5+jboss+lucene2.3.2+je-analysis(a chinese analysis)there are 10 million+ documents which total about 3GB test steps: 1 run single searcher.jsp in jboss(tuning ,and use 1GB ram)2 use loadrunner to test simulation 10 user concurrent request. the TPS(transactions per second) about 10 simulation 50 user concurrent request. the TPS(transactions per second) about 8 simulation 100 user concurrent request. the TPS(transactions per second) about 2 and the jsp was very simple,index in local file system------------------------------------------------------------------------------------------------- <body> <center> <form action="lucene.jsp" method="post" name="form1" > <input type="text" value="" name="keyword2"/> <input type="submit" value="searcher" onclick="SUB()"/>
    <input type="reset" value="exit"/> </form> </center> <hr> <% if(request.getParameter("keyword2")!=null && !"".equals(request.getParameter("keyword2"))) { String dir="/usr/local/index"; String key="name"; String word = new String(request.getParameter("keyword2"),"utf-8") ; Searcher searcher = null; searcher = new IndexSearcher(FSDirectory.getDirectory(dir, false)); Analyzer myAnalyzer=new jeasy.analysis.MMAnalyzer(); QueryParser queryParser=new QueryParser(key,myAnalyzer); Query query=queryParser.parse(word); Hits hits = null; long startTime = System.nanoTime(); hits= searcher.search(query); long estimatedTime = System.nanoTime() - startTime; BigDecimal bb = new BigDecimal(estimatedTime); BigDecimal ee = new BigDecimal(1000000000); System.out.println("Key word: "+word+" Hits:" + hits.length()+" Cost time: "+ bb.divide(ee) + "/s"); searcher.close(); } out.print("ABC") ; %> </body> ---------------
    ----------------------search.jsp--------------------------------------------------------- and I also try to use Singleton IndexSearcher ,but it's seam not helpful.-------------------------------------------------------------------------------- public IndexSearcher getIndexSearcher() throws IOException { if (this.indexSearcher == null) { return new IndexSearcher(FSDirectory.getDirectory(folder, false)); } else { IndexReader ir = indexSearcher.getIndexReader(); if (!ir.isCurrent()) { this.indexSearcher.close(); this.indexSearcher = new IndexSearcher(FSDirectory.getDirectory(folder, false)); ir = indexSearcher.getIndexReader(); if (ir.hasDeletions()) { if (this.indexWriter != null) { this.indexWriter.optimize(); } } } return this.indexSearcher; } }------------------------------------GetsingletonIndexsearcher.java --------------------------------------------- use the same code in application search one times per 0.5s average.so how do I i
    mprove the seaching performance in concurrent entironment ? Does the hd entironment: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz4GB ramgive me 50+TPS?
    _________________________________________________________________
    用手机MSN聊天写邮件看空间,无限沟通,分享精彩!
    http://mobile.msn.com.cn/
  • Toke Eskildsen at Jun 10, 2008 at 2:38 pm

    On Tue, 2008-06-10 at 21:11 +0800, lutan wrote:
    [A lot of text with code and no newlines, making it very hard to read]
    In your test you're reusing the searcher. For each search your program
    performs, you will see faster response times, until the searcher is
    fully warmed.

    If your production-system, you re-open your searcher every time and do
    not have the benefit of a warmed searcher.

    So yes, Singleton searcher helps, as opposed to opening a searcher for
    every search. Try making a test where the only thing you do is open a
    searcher 100 times and you will see that it takes a non-trivial amount
    of time.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lutan at Jun 10, 2008 at 4:18 pm
    Thanks for the reply!

    In my test case , I start loadrunner jsut test for 5 minute,and the response growth slowly.the TPS(transactions per second) seems stoped at 10 finally.
    I will run a test for a longer time again.
    In addition,does lucene has bottleneck about the number of documents or index size..?
    Date: Tue, 10 Jun 2008 16:34:17 +0200> From: te@statsbiblioteket.dk> Subject: Re: The performance of lucene searching(web entironment) test> To: java-user@lucene.apache.org> > On Tue, 2008-06-10 at 21:11 +0800, lutan wrote:> > [A lot of text with code and no newlines, making it very hard to read]> > In your test you're reusing the searcher. For each search your program> performs, you will see faster response times, until the searcher is> fully warmed.> > If your production-system, you re-open your searcher every time and do> not have the benefit of a warmed searcher.> > So yes, Singleton searcher helps, as opposed to opening a searcher for> every search. Try making a test where the only thing you do is open a> searcher 100 times and you will see that it takes a non-trivial amount> of time.> > > > ---------------------------------------------------------------------> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org> For additional commands, e-mail: java-user-h
    elp@lucene.apache.org>
    _________________________________________________________________
    Windows Live Photo gallery 数码相机的超级伴侣,轻松管理和编辑照片,还能制作全景美图!
    http://get.live.cn/product/photo.html
  • Toke Eskildsen at Jun 11, 2008 at 7:24 am

    On Wed, 2008-06-11 at 00:17 +0800, lutan wrote:
    In my test case , I start loadrunner jsut test for 5 minute,and the response
    growth slowly.the TPS(transactions per second) seems stoped at 10 finally.
    That's without reusing the searcher, right? In that case the increased
    rate must be attributed to the disk cache being warmed. Please try and
    test again with the searcher being reused.
    In addition,does lucene has bottleneck about the number of documents or index size..?
    Not to my knowledge. Search time and RAM consumption goes up, of course,
    but I'm not aware of any special point where things start to go bad at
    an increased rate.
    Does the hd entironment: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz4GB
    ramgive me 50+TPS?
    With an index of 10M/3GB? It doesn't sound unrealistic after warm-up.
    How much RAM is available for disk-cache, when the machine is running?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lutan at Jun 11, 2008 at 10:57 am
    Thanks for you replay!> Date: Wed, 11 Jun 2008 09:19:46 +0200> From: te@statsbiblioteket.dk> Subject: RE: The performance of lucene searching(web entironment) test> To: java-user@lucene.apache.org> > On Wed, 2008-06-11 at 00:17 +0800, lutan wrote:> > In my test case , I start loadrunner jsut test for 5 minute,and the response > > growth slowly.the TPS(transactions per second) seems stoped at 10 finally.> > That's without reusing the searcher, right? In that case the increased> rate must be attributed to the disk cache being warmed. Please try and> test again with the searcher being reused.>


    Yes ,I have test again with same entironment but to use singleton IndexSearcher.the performance
    has increased. there 100 concurrent user request use different keyword ,and get 60 TPS(2 TPS before).
    and now the bottleneck seem to be CPU,and the CPU using approach 100%.and both RAM(using 70MB average),
    HD using as normal.
    In addition,does lucene has bottleneck about the number of documents or index size..?> > Not to my knowledge. Search time and RAM consumption goes up, of course,> but I'm not aware of any special point where things start to go bad at> an increased rate.>
    Could I consider that as long as I have a larger capacity RAM ,and I
    will get a good performance.

    Does the hd entironment: Intel(R) Xeon(R) CPU 5110 @ 1.60GHz4GB > > ramgive me 50+TPS?> > With an index of 10M/3GB? It doesn't sound unrealistic after warm-up.> How much RAM is available for disk-cache, when the machine is running?>

    I don't understand " for disk-cache" meaning very clear.Could you please
    explain it again.Thanks a lot!(does't cache on RAM?)
    does warm-up == cache?
    how many docs do lucene will be cached default?and could I control the cache size?

    I am new to lucene ,maybe my questions looks not professional.
    forgive me.
    ---------------------------------------------------------------------> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org> For additional commands, e-mail: java-user-help@lucene.apache.org>
    _________________________________________________________________
    新年换新颜,快来妆扮自己的MSN给心仪的TA一个惊喜!
    http://im.live.cn/emoticons/?ID=18
  • Toke Eskildsen at Jun 13, 2008 at 7:03 am

    On Wed, 2008-06-11 at 18:56 +0800, lutan wrote:
    Yes ,I have test again with same entironment but to use singleton
    IndexSearcher.the performance has increased. there 100 concurrent
    user request use different keyword ,and get 60 TPS(2 TPS before).
    and now the bottleneck seem to be CPU,and the CPU using approach
    100%.and both RAM(using 70MB average), HD using as normal.
    It sounds like you have found the solution to your immediate problem.
    Great.
    Could I consider that as long as I have a larger capacity RAM ,and I
    will get a good performance.
    Depends on your index-size (in bytes). When your index grows, less and
    less of it can fit in the disk-cache and more time will be required for
    proper warm-up. But the change will happen gradually, so you'll only be
    surprised if you suddenly increase your index-size to double or more
    size.
    I don't understand " for disk-cache" meaning very clear.Could you please
    explain it again.Thanks a lot!(does't cache on RAM?)
    does warm-up == cache?
    There are (at least) two important memory mechanisms to consider.
    My apologies if some of this is basic knowledge to you:

    1) Disk-cache.
    In general, the free RAM on your Linux-system is used for disk-cache.
    With an index-size of 3GB and (just a guess) 1 GB free RAM, the
    operating system is able to cache 1/3 or less of your index. If you open
    the same index several times in a row, the disk-cache will be warmed to
    the relevant parts of your index, so that you're not even hitting the
    disk after a while. At least not for opening. This is the effect you
    observed with your non-singleton based test, where the speed increased
    slowly up to a not-so-high level.

    2) Lucene internal structures.
    I don't know much about this, so I hope somebody will correct me if I
    make mistakes: Lucene has some internal structures that are initialized
    when searches are performed. Depending on setup, this initialization can
    be quite heavy (custom search for example). Performing warm-up, such as
    searching with previously logged queries, will initialize these
    structures before the real queries are received. This is the effect you
    observed with your singleton searcher.

    1 & 2 can be seen in combination, as the initialization of the internal
    structures in Lucene requires a fair amount of seeks in the index data.
    If there's nothing in the disk-cache and a conventional platter-based
    harddisk is used, it takes some time. If the disk-cache is warmed from
    previous use or a solid state drive setup is used, it is much faster.
    how many docs do lucene will be cached default?and could I control the
    cache size?
    I don't know. Maybe someone else will chime in?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Lutan at Jun 13, 2008 at 10:44 am
    Very grateful for Toke Eskildsen of attention my questions.
    Date: Fri, 13 Jun 2008 08:59:27 +0200> From: te@statsbiblioteket.dk> Subject: RE: The performance of lucene searching(web entironment) test> To: java-user@lucene.apache.org> > On Wed, 2008-06-11 at 18:56 +0800, lutan wrote:> > Yes ,I have test again with same entironment but to use singleton > > IndexSearcher.the performance has increased. there 100 concurrent> > user request use different keyword ,and get 60 TPS(2 TPS before).> > and now the bottleneck seem to be CPU,and the CPU using approach > > 100%.and both RAM(using 70MB average), HD using as normal.> > It sounds like you have found the solution to your immediate problem.> Great.>

    The performance increase dependents on your suggestion.
    Today I hava another tesing,and using RemoteSearchable(code like
    the example of <lucene in action> supply).
    app runing setps:
    1,A customer request a keyword to web(JBoss:192.168.0.1).
    2,JBoss call RMIServer(192.168.0.2)(the index file on it).
    other tesing entironment as same as before.

    the result:
    loadrunner: 300 concurrent user(I find one user ,one TCP/IP
    connection form WebServer to RMIServer),
    and the TPS got 180+,web response time is
    about 2 second average. both WebServer and RMIServer
    has being using as normal of
    cpu(50%),ram(not full).

    the performance almost achieve thrice !
    It's amazing to me:)
    I consider the method of RMI would hava low performance(
    because of expensively net using),
    but the result is really puzzled me :(


    Could I consider that as long as I have a larger capacity RAM ,and I > > will get a good performance.> > Depends on your index-size (in bytes). When your index grows, less and> less of it can fit in the disk-cache and more time will be required for> proper warm-up. But the change will happen gradually, so you'll only be> surprised if you suddenly increase your index-size to double or more> size.>
    I don't understand " for disk-cache" meaning very clear.Could you please> > explain it again.Thanks a lot!(does't cache on RAM?)> > does warm-up == cache?> > There are (at least) two important memory mechanisms to consider.> My apologies if some of this is basic knowledge to you:> > 1) Disk-cache.> In general, the free RAM on your Linux-system is used for disk-cache.> With an index-size of 3GB and (just a guess) 1 GB free RAM, the> operating system is able to cache 1/3 or less of your index. If you open> the same index several times in a row, the disk-cache will be warmed to> the relevant parts of your index, so that you're not even hitting the> disk after a while. At least not for opening. This is the effect you> observed with your non-singleton based test, where the speed increased> slowly up to a not-so-high level.> > 2) Lucene internal structures.> I don't know much about this, so I hope somebody will correct me if I> make mistakes: Lucene has some internal structures
    that are initialized> when searches are performed. Depending on setup, this initialization can> be quite heavy (custom search for example). Performing warm-up, such as> searching with previously logged queries, will initialize these> structures before the real queries are received. This is the effect you> observed with your singleton searcher.> > 1 & 2 can be seen in combination, as the initialization of the internal> structures in Lucene requires a fair amount of seeks in the index data.> If there's nothing in the disk-cache and a conventional platter-based> harddisk is used, it takes some time. If the disk-cache is warmed from> previous use or a solid state drive setup is used, it is much faster.>


    I have understand it by your reply,thanks a lot.
    how many docs do lucene will be cached default?and could I control the> > cache size?> > I don't know. Maybe someone else will chime in?> > > ---------------------------------------------------------------------> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org> For additional commands, e-mail: java-user-help@lucene.apache.org>
    _________________________________________________________________
    用手机MSN聊天写邮件看空间,无限沟通,分享精彩!
    http://mobile.msn.com.cn/
  • John Byrne at Jun 10, 2008 at 1:29 pm
    Hi,

    I could do it that way, but couting the spans per document is specific
    to SpanQuerys. I would still have to count hits for TermQuerys
    separately. I was looking for a generic way to count hits for any
    instance of Query within a document.

    To put it another way, the ability to find the Term frequency in a
    single document seems incomplete, since a Term does not equate to a hit.
    For instance, sticking with my previous example, if my document
    contained a thousand occurrences of "cats" but only one of them is near
    "dogs", then the frequency of the Term "cats" in that document is
    irrelevant to me.

    In general, my queries will consist of a BooleanQuery containing any
    number of sub-queries of any implementation - what I actually need to
    know is how many hits there are for that BooleanQuery query in each
    document. Maybe I will expand the BooleanQuery into all it's sub-queries
    recursively, and then handle them by type - counting spans per document
    for SpanQuerys and using the Term frequency for TermQuerys. I was just
    hoping there would be an existing (and fast) way to do this.

    Thanks,
    John

    Grant Ingersoll wrote:
    A SpanQuery is just a Query, so the traditional way of Querying still
    applies, i.e. you get back a list of matching documents. Beyond that,
    if you just want to operate on the spans, just keep track of how often
    the doc() method changes.

    HTH,
    Grant
    On Jun 9, 2008, at 11:21 AM, John Byrne wrote:

    Hi,

    Is there an easy way to find out the number of hits per document for
    a Query, rather than just for a Term?

    Let's say, for example, I have a document like this:

    "here is cats near dogs and here is cats a long long way from dogs"

    and I use a SpanNearQuery to find "cats" near "dogs" with a slop of 1
    - I need to be able to find out that there was 1 hit, even though
    there are 2 occurrences of "cats" and 2 of "dogs" - there is still
    only 1 hit that matches my Query.

    Is this possible?

    Thanks,
    JB.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Spencer Tickner at Jun 10, 2008 at 8:43 pm
    Hi John,

    Sorry I don't have a solution for you but I'm trying to do the same
    thing. I would love to hear from you if you have any success with
    this.

    Cheers,

    Spencer
    spencertickner@gmail.com
    On Tue, Jun 10, 2008 at 6:28 AM, John Byrne wrote:
    Hi,

    I could do it that way, but couting the spans per document is specific to
    SpanQuerys. I would still have to count hits for TermQuerys separately. I
    was looking for a generic way to count hits for any instance of Query within
    a document.

    To put it another way, the ability to find the Term frequency in a single
    document seems incomplete, since a Term does not equate to a hit. For
    instance, sticking with my previous example, if my document contained a
    thousand occurrences of "cats" but only one of them is near "dogs", then the
    frequency of the Term "cats" in that document is irrelevant to me.

    In general, my queries will consist of a BooleanQuery containing any number
    of sub-queries of any implementation - what I actually need to know is how
    many hits there are for that BooleanQuery query in each document. Maybe I
    will expand the BooleanQuery into all it's sub-queries recursively, and then
    handle them by type - counting spans per document for SpanQuerys and using
    the Term frequency for TermQuerys. I was just hoping there would be an
    existing (and fast) way to do this.

    Thanks,
    John

    Grant Ingersoll wrote:
    A SpanQuery is just a Query, so the traditional way of Querying still
    applies, i.e. you get back a list of matching documents. Beyond that, if
    you just want to operate on the spans, just keep track of how often the
    doc() method changes.

    HTH,
    Grant
    On Jun 9, 2008, at 11:21 AM, John Byrne wrote:

    Hi,

    Is there an easy way to find out the number of hits per document for a
    Query, rather than just for a Term?

    Let's say, for example, I have a document like this:

    "here is cats near dogs and here is cats a long long way from dogs"

    and I use a SpanNearQuery to find "cats" near "dogs" with a slop of 1 - I
    need to be able to find out that there was 1 hit, even though there are 2
    occurrences of "cats" and 2 of "dogs" - there is still only 1 hit that
    matches my Query.

    Is this possible?

    Thanks,
    JB.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Jun 17, 2008 at 10:05 pm
    : > I could do it that way, but couting the spans per document is specific to
    : > SpanQuerys. I would still have to count hits for TermQuerys separately. I
    : > was looking for a generic way to count hits for any instance of Query within
    : > a document.

    the orriginal Query, Weight, and Scorer APIs provided no mechanism for
    doing this -- this is one of the reasons why the SpanQuery API exists, to
    model the types of queries that (can) collect this type of information as
    they score documents. Non-Span based queries typically have no idea about
    this type of information. (which typically allows them to be faster)



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 9, '08 at 3:21p
activeJun 17, '08 at 10:05p
posts12
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase