FAQ
Hi,
Merry Christmas!!

In case of Boolean query like 'sql AND server' .
I am using parser to get correct document containing both sql and server. Inside for loop in below code I get correct documented and to get frequency I need to sum frequency of 'sql' and 'server' individually with the help of termDocs.read().
As I am searching through millions of document. So, to calculate frequency it takes about 160 Second for 80k document.

Is there any way to get frequency of 'Boolean query' directly without manipulation. As it takes lots of time. In case of single term and phrase query, I got frequency for 80k document within 10 Seconds.

QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, field, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
Query query = parser.parse("sql AND server");
TopDocs docs = searcher.search(query, null, n);
TermDocs termDocs = reader.termDocs();
termDocs.seek(new Term(field, query.toString(field).split(" ")[0].toLowerCase()));
int[] ints = new int[docs.totalHits];
int[] ints1 = new int[docs.totalHits];
termDocs.read(ints, ints1);
List lsts = Arrays.asList(ArrayUtils.toObject(ints));
List lsts1 = Arrays.asList(ArrayUtils.toObject(ints1));

termDocs.seek(new Term(field, query.toString(field).split(" ")[1].toLowerCase()));
int[] inta = new int[docs.totalHits];
int[] inta1 = new int[docs.totalHits];
termDocs.read(inta, inta1);
List lsta = Arrays.asList(ArrayUtils.toObject(inta));
List lsta1 = Arrays.asList(ArrayUtils.toObject(inta1));

int totalFreq = 0;
int docId = -1;
String a = null;
int contId = 0;
String path=null;
for (int i = 0; i < docs2.scoreDocs.length; i++) {
docId = docs2.scoreDocs[i].doc;
path=reader.document(docId).get("path");
a = path.substring(path.lastIndexOf("\\") + 1, path.lastIndexOf("."));
try {
if(a.indexOf("e")>-1){
contId = Integer.parseInt(a.substring(0, a.length()-1));
}else{
contId = Integer.parseInt(a);
}
} catch (Exception e) {
// e.printStackTrace();
}

if ((lsts.indexOf(docId) > -1 && lsta.indexOf(docId) > -1) || (lsta.indexOf(docId) > -1) && lsts.indexOf(docId) > -1) {
totalFreq = (Integer) lsts1.get(lsts.indexOf(docId)) + (Integer) lsta1.get(lsta.indexOf(docId));
} else if (lsts.indexOf(docId) > -1) {
totalFreq = (Integer) lsts1.get(lsts.indexOf(docId));
} else if (lsta.indexOf(docId) > -1) {
totalFreq = (Integer) lsta1.get(lsta.indexOf(docId));
}

w.write(contId+"\t"+ID+"\t"+totalFreq+"\t"+reader.document(docId).get("path")+"\n");
}


Thanks & Regards,
Ranjit Kumar
Associate Software Engineer

[cid:image002.jpg@01CB7089.C0069B40]

US: +1 408.540.0001
UK: +44 208.099.1660
India: +91 124.474.8100 | +91 124.410.1350
FAX: +1 408.516.9050
http://www.otssolutions.com

=================================================================================================== Private, Confidential and Privileged. This e-mail and any files and attachments transmitted with it are confidential and/or privileged. They are intended solely for the use of the intended recipient. The content of this e-mail and any file or attachment transmitted with it may have been changed or altered without the consent of the author. If you are not the intended recipient, please note that any review, dissemination, disclosure, alteration, printing, circulation or Transmission of this e-mail and/or any file or attachment transmitted with it, is prohibited and may be unlawful. If you have received this e-mail or any file or attachment transmitted with it in error please notify OTS Solutions at info@otssolutions.com ===================================================================================================

Search Discussions

  • Li Li at Dec 27, 2010 at 2:59 am
    do you mean get tf of the hited documents when doing search?
    it's a difficult problem because only TermScorer has TermDocs and using tf
    in score() function.
    but this we can't know whether these doc is selected because we use a
    priorityQueue in TopScoreDocCollector

    public void collect(int doc) throws IOException {
    float score = scorer.score();

    // This collector cannot handle these scores:
    assert score != Float.NEGATIVE_INFINITY;
    assert !Float.isNaN(score);

    totalHits++;
    if (score <= pqTop.score) {
    // Since docs are returned in-order (i.e., increasing doc Id), a
    document
    // with equal score to pqTop.score cannot compete since HitQueue
    favors
    // documents with lower doc Ids. Therefore reject those docs too.
    return;
    }
    pqTop.doc = doc + docBase;
    pqTop.score = score;
    pqTop = (ScoreDoc) pq.updateTop();
    }
    Because the scorer in BooleanScorer2 is
    ConjuctionScorer(DisjuctionScorer). it can not get tf.
    If you want to do this. you have to modify many codes
    2010/12/25 Ranjit Kumar <Ranjit.Kumar@otssolutions.com>
    Hi,

    Merry Christmas!!



    In case of Boolean query like *’sql AND server’ *.

    I am using parser to get correct document containing both *sql* and *
    server*. Inside for loop in below code I get correct documented and to get
    frequency I need to sum frequency of ‘sql’ and ‘server’ individually with
    the help of *termDocs.read(). *

    As I am searching through millions of document. So, to calculate frequency
    it takes about *160 Second* for *80k* document.



    Is there any way to get frequency of ‘*Boolean query’ *directly without
    manipulation. As it takes lots of time. In case of single term and phrase
    query, I got frequency for 80k document within 10 Seconds.



    *QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, field,
    analyzer);*

    *parser.setDefaultOperator(QueryParser.AND_OPERATOR);*

    *Query query = parser.parse(“sql AND server”);*

    *TopDocs docs = searcher.search(query, null, n);*

    *TermDocs termDocs = reader.termDocs();*

    * termDocs.seek(new Term(field,
    query.toString(field).split(" ")[0].toLowerCase()));*

    * int[] ints = new int[docs.totalHits];*

    * int[] ints1 = new int[docs.totalHits];*

    * termDocs.read(ints, ints1);*

    * List lsts = Arrays.asList(ArrayUtils.toObject(ints));
    *

    * List lsts1 =
    Arrays.asList(ArrayUtils.toObject(ints1));*

    * *

    * termDocs.seek(new Term(field,
    query.toString(field).split(" ")[1].toLowerCase()));*

    * int[] inta = new int[docs.totalHits];*

    * int[] inta1 = new int[docs.totalHits];*

    * termDocs.read(inta, inta1);*

    * List lsta = Arrays.asList(ArrayUtils.toObject(inta));
    *

    * List lsta1 =
    Arrays.asList(ArrayUtils.toObject(inta1));*

    * *

    * int totalFreq = 0;*

    * int docId = -1;*

    * String a = null;*

    * int contId = 0;*

    * String path=null;*

    *for (int i = 0; i < docs2.scoreDocs.length; i++) { *

    * docId = docs2.scoreDocs[i].doc;*

    * path=reader.document(docId).get("path");*

    * a = path.substring(path.lastIndexOf("\\") + 1,
    path.lastIndexOf("."));*

    * try {*

    * if(a.indexOf("e")>-1){*

    * contId =
    Integer.parseInt(a.substring(0, a.length()-1));*

    * }else{*

    * contId = Integer.parseInt(a);*

    * }*

    * } catch (Exception e) {*

    *// e.printStackTrace();*

    * }*

    * *

    * if ((lsts.indexOf(docId) > -1 &&
    lsta.indexOf(docId) > -1) || (lsta.indexOf(docId) > -1) &&
    lsts.indexOf(docId) > -1) {*

    * totalFreq = (Integer)
    lsts1.get(lsts.indexOf(docId)) + (Integer) lsta1.get(lsta.indexOf(docId));
    *

    * } else if (lsts.indexOf(docId) > -1) {*

    * totalFreq = (Integer)
    lsts1.get(lsts.indexOf(docId));*

    * } else if (lsta.indexOf(docId) > -1) {*

    * totalFreq = (Integer)
    lsta1.get(lsta.indexOf(docId));*

    * }*

    * *

    *
    w.write(contId+"\t"+ID+"\t"+totalFreq+"\t"+reader.document(docId).get("path")+"\n");
    *

    * }*





    Thanks & Regards,

    *Ranjit Kumar ***

    Associate Software Engineer



    [image: cid:image002.jpg@01CB7089.C0069B40]



    *US:* +1 408.540.0001

    *UK:* +44 208.099.1660

    *India:* +91 124.474.8100 | +91 124.410.1350

    *FAX:* +1 408.516.9050

    http://www.otssolutions.com


    ===================================================================================================
    Private, Confidential and Privileged. This e-mail and any files and
    attachments transmitted with it are confidential and/or privileged. They are
    intended solely for the use of the intended recipient. The content of this
    e-mail and any file or attachment transmitted with it may have been changed
    or altered without the consent of the author. If you are not the intended
    recipient, please note that any review, dissemination, disclosure,
    alteration, printing, circulation or Transmission of this e-mail and/or any
    file or attachment transmitted with it, is prohibited and may be unlawful.
    If you have received this e-mail or any file or attachment transmitted with
    it in error please notify OTS Solutions at info@otssolutions.com===================================================================================================

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedDec 25, '10 at 10:21a
activeDec 27, '10 at 2:59a
posts2
users2
websitelucene.apache.org

2 users in discussion

Ranjit Kumar: 1 post Li Li: 1 post

People

Translate

site design / logo © 2022 Grokbase