FAQ
hi all ,
I have a problem that how to "combine" two score to sort the search
result documents.
for example I have 10 million pages in lucene index , and i know their
pagerank scores. i give a query to it , every docs returned have a
lucene-score, mark it as R (relevant score), and i also have its
pagerank score, mark it as P, what i need is i want to sort the search
result base on the value "P+R". You know if i store the pagerank score in
index and get it every search time , then compute P+R , then sort it , this
way is too slow. in my system , when the search hits 500000 result , the
sort may cost about 20s.
Sorry for my poor english. Anyone has a good idea?

Best
Jarvis

Search Discussions

  • Ian Lea at May 28, 2008 at 10:12 am
    Hi


    Maybe you could use the pagerank score, possibly modified, as document
    boost at indexing time. From the javadocs for
    Document.setBoost(boost)

    "Sets a boost factor for hits on any field of this document. This
    value will be multiplied into the score of all hits on this document"

    so will give you P * R rather than P + R. Should be quick, though.


    --
    Ian.

    On Wed, May 28, 2008 at 11:02 AM, 过佳 wrote:
    hi all ,
    I have a problem that how to "combine" two score to sort the search
    result documents.
    for example I have 10 million pages in lucene index , and i know their
    pagerank scores. i give a query to it , every docs returned have a
    lucene-score, mark it as R (relevant score), and i also have its
    pagerank score, mark it as P, what i need is i want to sort the search
    result base on the value "P+R". You know if i store the pagerank score in
    index and get it every search time , then compute P+R , then sort it , this
    way is too slow. in my system , when the search hits 500000 result , the
    sort may cost about 20s.
    Sorry for my poor english. Anyone has a good idea?

    Best
    Jarvis
  • 过佳 at May 28, 2008 at 11:48 am
    thanks lan, but this means that i must reindex these pages while the
    pagerank score changed?

    在08-5-28,Ian Lea <ian.lea@gmail.com> 写道:
    Hi


    Maybe you could use the pagerank score, possibly modified, as document
    boost at indexing time. From the javadocs for
    Document.setBoost(boost)

    "Sets a boost factor for hits on any field of this document. This
    value will be multiplied into the score of all hits on this document"

    so will give you P * R rather than P + R. Should be quick, though.


    --
    Ian.

    On Wed, May 28, 2008 at 11:02 AM, 过佳 wrote:
    hi all ,
    I have a problem that how to "combine" two score to sort the search
    result documents.
    for example I have 10 million pages in lucene index , and i know their
    pagerank scores. i give a query to it , every docs returned have a
    lucene-score, mark it as R (relevant score), and i also have its
    pagerank score, mark it as P, what i need is i want to sort the search
    result base on the value "P+R". You know if i store the pagerank score in
    index and get it every search time , then compute P+R , then sort it , this
    way is too slow. in my system , when the search hits 500000 result , the
    sort may cost about 20s.
    Sorry for my poor english. Anyone has a good idea?

    Best
    Jarvis
  • Ian Lea at May 28, 2008 at 1:04 pm
    Yes. But you'd have to do that anyway if you are storing pagerank in the index.

    One point on your 20s response time for sorting - is that for the
    first sort or subsequent ones?
    I believe that the first one will usually be substantially slower.
    But sorting is always likely to be slower than not sorting.


    --
    Ian.

    On Wed, May 28, 2008 at 12:47 PM, 过佳 wrote:
    thanks lan, but this means that i must reindex these pages while the
    pagerank score changed?

    在08-5-28,Ian Lea <ian.lea@gmail.com> 写道:
    Hi


    Maybe you could use the pagerank score, possibly modified, as document
    boost at indexing time. From the javadocs for
    Document.setBoost(boost)

    "Sets a boost factor for hits on any field of this document. This
    value will be multiplied into the score of all hits on this document"

    so will give you P * R rather than P + R. Should be quick, though.


    --
    Ian.

    On Wed, May 28, 2008 at 11:02 AM, 过佳 wrote:
    hi all ,
    I have a problem that how to "combine" two score to sort the search
    result documents.
    for example I have 10 million pages in lucene index , and i know their
    pagerank scores. i give a query to it , every docs returned have a
    lucene-score, mark it as R (relevant score), and i also have its
    pagerank score, mark it as P, what i need is i want to sort the search
    result base on the value "P+R". You know if i store the pagerank score in
    index and get it every search time , then compute P+R , then sort it , this
    way is too slow. in my system , when the search hits 500000 result , the
    sort may cost about 20s.
    Sorry for my poor english. Anyone has a good idea?

    Best
    Jarvis
  • 过佳 at May 28, 2008 at 1:46 pm
    I think this is not suitable for my system since the num of pages is very
    large that will cost much time for reindex

    2008/5/28, Ian Lea <ian.lea@gmail.com>:
    Yes. But you'd have to do that anyway if you are storing pagerank in the
    index.

    One point on your 20s response time for sorting - is that for the
    first sort or subsequent ones?
    I believe that the first one will usually be substantially slower.
    But sorting is always likely to be slower than not sorting.


    --
    Ian.

    On Wed, May 28, 2008 at 12:47 PM, 过佳 wrote:
    thanks lan, but this means that i must reindex these pages while the
    pagerank score changed?

    在08-5-28,Ian Lea <ian.lea@gmail.com> 写道:
    Hi


    Maybe you could use the pagerank score, possibly modified, as document
    boost at indexing time. From the javadocs for
    Document.setBoost(boost)

    "Sets a boost factor for hits on any field of this document. This
    value will be multiplied into the score of all hits on this document"

    so will give you P * R rather than P + R. Should be quick, though.


    --
    Ian.

    On Wed, May 28, 2008 at 11:02 AM, 过佳 wrote:
    hi all ,
    I have a problem that how to "combine" two score to sort the
    search
    result documents.
    for example I have 10 million pages in lucene index , and i know their
    pagerank scores. i give a query to it , every docs returned have a
    lucene-score, mark it as R (relevant score), and i also have its
    pagerank score, mark it as P, what i need is i want to sort the
    search
    result base on the value "P+R". You know if i store the pagerank
    score
    in
    index and get it every search time , then compute P+R , then sort it , this
    way is too slow. in my system , when the search hits 500000 result ,
    the
    sort may cost about 20s.
    Sorry for my poor english. Anyone has a good idea?

    Best
    Jarvis
  • Glen Newton at May 28, 2008 at 2:52 pm
    You should consider keeping the PageRank (and any other more dynamic
    data) in a separate index (with the documents in the same oder as your
    bigger, more static index) and then use a ParallelReader on both of
    them. See:
    http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/index/ParallelReader.html

    -Glen

    2008/5/28 过佳 <nttstar@gmail.com>:
    I think this is not suitable for my system since the num of pages is very
    large that will cost much time for reindex

    2008/5/28, Ian Lea <ian.lea@gmail.com>:
    Yes. But you'd have to do that anyway if you are storing pagerank in the
    index.

    One point on your 20s response time for sorting - is that for the
    first sort or subsequent ones?
    I believe that the first one will usually be substantially slower.
    But sorting is always likely to be slower than not sorting.


    --
    Ian.

    On Wed, May 28, 2008 at 12:47 PM, 过佳 wrote:
    thanks lan, but this means that i must reindex these pages while the
    pagerank score changed?

    在08-5-28,Ian Lea <ian.lea@gmail.com> 写道:
    Hi


    Maybe you could use the pagerank score, possibly modified, as document
    boost at indexing time. From the javadocs for
    Document.setBoost(boost)

    "Sets a boost factor for hits on any field of this document. This
    value will be multiplied into the score of all hits on this document"

    so will give you P * R rather than P + R. Should be quick, though.


    --
    Ian.

    On Wed, May 28, 2008 at 11:02 AM, 过佳 wrote:
    hi all ,
    I have a problem that how to "combine" two score to sort the
    search
    result documents.
    for example I have 10 million pages in lucene index , and i know their
    pagerank scores. i give a query to it , every docs returned have a
    lucene-score, mark it as R (relevant score), and i also have its
    pagerank score, mark it as P, what i need is i want to sort the
    search
    result base on the value "P+R". You know if i store the pagerank
    score
    in
    index and get it every search time , then compute P+R , then sort it , this
    way is too slow. in my system , when the search hits 500000 result ,
    the
    sort may cost about 20s.
    Sorry for my poor english. Anyone has a good idea?

    Best
    Jarvis


    --

    -
  • 过佳 at May 29, 2008 at 12:29 pm
    thanks Glen , we have tried it , but the bottleneck is to get the document
    (indexReader.document(num)), so it is not efficient enough .

    2008/5/28, Glen Newton <glen.newton@gmail.com>:
    You should consider keeping the PageRank (and any other more dynamic
    data) in a separate index (with the documents in the same oder as your
    bigger, more static index) and then use a ParallelReader on both of
    them. See:

    http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/index/ParallelReader.html

    -Glen

    2008/5/28 过佳 <nttstar@gmail.com>:
    I think this is not suitable for my system since the num of pages is very
    large that will cost much time for reindex

    2008/5/28, Ian Lea <ian.lea@gmail.com>:
    Yes. But you'd have to do that anyway if you are storing pagerank in
    the
    index.

    One point on your 20s response time for sorting - is that for the
    first sort or subsequent ones?
    I believe that the first one will usually be substantially slower.
    But sorting is always likely to be slower than not sorting.


    --
    Ian.

    On Wed, May 28, 2008 at 12:47 PM, 过佳 wrote:
    thanks lan, but this means that i must reindex these pages while the
    pagerank score changed?

    在08-5-28,Ian Lea <ian.lea@gmail.com> 写道:
    Hi


    Maybe you could use the pagerank score, possibly modified, as
    document
    boost at indexing time. From the javadocs for
    Document.setBoost(boost)

    "Sets a boost factor for hits on any field of this document. This
    value will be multiplied into the score of all hits on this document"

    so will give you P * R rather than P + R. Should be quick, though.


    --
    Ian.

    On Wed, May 28, 2008 at 11:02 AM, 过佳 wrote:
    hi all ,
    I have a problem that how to "combine" two score to sort the
    search
    result documents.
    for example I have 10 million pages in lucene index , and i
    know
    their
    pagerank scores. i give a query to it , every docs returned have a
    lucene-score, mark it as R (relevant score), and i also have its
    pagerank score, mark it as P, what i need is i want to sort the
    search
    result base on the value "P+R". You know if i store the pagerank
    score
    in
    index and get it every search time , then compute P+R , then sort
    it ,
    this
    way is too slow. in my system , when the search hits 500000 result
    ,
    the
    sort may cost about 20s.
    Sorry for my poor english. Anyone has a good idea?

    Best
    Jarvis


    --

    -
  • Cam Bazz at May 29, 2008 at 2:44 pm
    Hello,

    little off topic, but how did you obtain the pagerank for each page. did you
    calculate it, or have you obtained it with some other way while getting a
    specific site.

    Best.
    On Thu, May 29, 2008 at 3:28 PM, 过佳 wrote:

    thanks Glen , we have tried it , but the bottleneck is to get the document
    (indexReader.document(num)), so it is not efficient enough .

    2008/5/28, Glen Newton <glen.newton@gmail.com>:
    You should consider keeping the PageRank (and any other more dynamic
    data) in a separate index (with the documents in the same oder as your
    bigger, more static index) and then use a ParallelReader on both of
    them. See:

    http://lucene.apache.org/java/2_1_0/api/org/apache/lucene/index/ParallelReader.html
    -Glen

    2008/5/28 过佳 <nttstar@gmail.com>:
    I think this is not suitable for my system since the num of pages is
    very
    large that will cost much time for reindex

    2008/5/28, Ian Lea <ian.lea@gmail.com>:
    Yes. But you'd have to do that anyway if you are storing pagerank in
    the
    index.

    One point on your 20s response time for sorting - is that for the
    first sort or subsequent ones?
    I believe that the first one will usually be substantially slower.
    But sorting is always likely to be slower than not sorting.


    --
    Ian.

    On Wed, May 28, 2008 at 12:47 PM, 过佳 wrote:
    thanks lan, but this means that i must reindex these pages while the
    pagerank score changed?

    在08-5-28,Ian Lea <ian.lea@gmail.com> 写道:
    Hi


    Maybe you could use the pagerank score, possibly modified, as
    document
    boost at indexing time. From the javadocs for
    Document.setBoost(boost)

    "Sets a boost factor for hits on any field of this document. This
    value will be multiplied into the score of all hits on this
    document"
    so will give you P * R rather than P + R. Should be quick, though.


    --
    Ian.

    On Wed, May 28, 2008 at 11:02 AM, 过佳 wrote:
    hi all ,
    I have a problem that how to "combine" two score to sort the
    search
    result documents.
    for example I have 10 million pages in lucene index , and i
    know
    their
    pagerank scores. i give a query to it , every docs returned have
    a
    lucene-score, mark it as R (relevant score), and i also have
    its
    pagerank score, mark it as P, what i need is i want to sort the
    search
    result base on the value "P+R". You know if i store the pagerank
    score
    in
    index and get it every search time , then compute P+R , then sort
    it ,
    this
    way is too slow. in my system , when the search hits 500000
    result
    ,
    the
    sort may cost about 20s.
    Sorry for my poor english. Anyone has a good idea?

    Best
    Jarvis


    --

    -
  • Doron Cohen at Jun 2, 2008 at 5:24 am
    Hi Jarvis,

    I have a problem that how to "combine" two score to sort the search
    result documents.
    for example I have 10 million pages in lucene index , and i know their
    pagerank scores. i give a query to it , every docs returned have a
    lucene-score, mark it as R (relevant score), and i also have its
    pagerank score, mark it as P, what i need is i want to sort the search
    result base on the value "P+R". You know if i store the pagerank score in
    index and get it every search time , then compute P+R , then sort it , this
    way is too slow. in my system , when the search hits 500000 result , the
    sort may cost about 20s.
    Check CustomScoreQuery in
    http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/search/function/package-summary.html

    Probably something like this:
    - implement ValueSource on top of the pagerank values,
    - create a valueSourceQuery on top of it,
    - create a customScoreQuery on top of the original query and the
    valueSourceQuery.
    Note that by default, customScoreQuery multiplies the scores, but you can
    override this.

    Doron

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 28, '08 at 10:03a
activeJun 2, '08 at 5:24a
posts9
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase