FAQ
Hi, I am new to use lucene, I have a query string of multiple terms. i) i want to return query string by removing stop words and stemmed version of the query.
ii) second i want to get tf and idf of each term in a query, how to get it?







Asif


_________________________________________________________________
Hotmail: Trusted email with powerful SPAM protection.
https://signup.live.com/signup.aspx?id=60969

Search Discussions

  • Phan The Dai at Feb 2, 2010 at 8:00 pm
    with my idea,
    using BooleanQuery, you can make every thing.

    On Mon, Feb 1, 2010 at 10:44 PM, Asif Nawaz wrote:


    Hi, I am new to use lucene, I have a query string of multiple terms. i) i
    want to return query string by removing stop words and stemmed version of
    the query.
    ii) second i want to get tf and idf of each term in a query, how to get it?







    Asif


    _________________________________________________________________
    Hotmail: Trusted email with powerful SPAM protection.
    https://signup.live.com/signup.aspx?id=60969
  • Java8964 java8964 at Feb 2, 2010 at 8:56 pm
    Hi, I have the following test case point to the index generated in our application. The result is confusing me and I don't know the reason.

    Lucene version: 2.9.0
    JDK 1.6.0_18

    public class IndexTest1 {
    public static void main(String[] args) {
    try {
    FSDirectory directory = FSDirectory.open(new File("/path_to_index_files"));
    IndexSearcher searcher = new IndexSearcher(directory, true);
    PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer());
    wrapper.addAnalyzer("f1string_sif", new KeywordAnalyzer());
    wrapper.addAnalyzer("f2string_ti", new StandardAnalyzer(Version.LUCENE_CURRENT));
    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank*");
    System.out.println("query = " + query);
    System.out.println("hits = " + searcher.search(query, 100).totalHits);
    searcher.close();
    } catch (Exception e) {
    System.out.println(e);
    }
    }
    }

    Output:
    query = f2string_ti:subbank*
    hits = 6

    If I change the line to the following:

    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:rdmap*");

    Output:
    query = f2string_ti:rdmap*
    hits = 4

    The above result are both correct based on my data.

    Now if I change the line to:

    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank* OR f2string_ti:rdmap*");

    Output:
    query = f2string_ti:subbank* f2string_ti:rdmap*
    hits = 2


    I assume the count in the last result should be larger than max(6,4), but it is 2. Any reason for that?

    Thanks


    _________________________________________________________________
    Hotmail: Trusted email with powerful SPAM protection.
    http://clk.atdmt.com/GBL/go/201469227/direct/01/
  • Ian Lea at Feb 3, 2010 at 10:03 am
    You should probably be using your PerFieldAnalyzerWrapper in your
    calls to QueryParser but apart from that I can't see any obvious
    reason. General advice: use Luke to check what has been indexed and
    read http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F

    If none of these help, post again but showing what you are indexing as
    well as how you are searching - the smallest possible test case or
    self-contained program that shows the problem.

    Or maybe someone else will spot the problem.


    --
    Ian.


    On Tue, Feb 2, 2010 at 8:56 PM, java8964 java8964 wrote:

    Hi, I have the following test case point to the index generated in our application. The result is confusing me and I don't know the reason.

    Lucene version: 2.9.0
    JDK 1.6.0_18

    public class IndexTest1 {
    public static void main(String[] args) {
    try {
    FSDirectory directory = FSDirectory.open(new File("/path_to_index_files"));
    IndexSearcher searcher = new IndexSearcher(directory, true);
    PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer());
    wrapper.addAnalyzer("f1string_sif", new KeywordAnalyzer());
    wrapper.addAnalyzer("f2string_ti", new StandardAnalyzer(Version.LUCENE_CURRENT));
    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank*");
    System.out.println("query = " + query);
    System.out.println("hits = " + searcher.search(query, 100).totalHits);
    searcher.close();
    } catch (Exception e) {
    System.out.println(e);
    }
    }
    }

    Output:
    query = f2string_ti:subbank*
    hits = 6

    If I change the line to the following:

    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:rdmap*");

    Output:
    query = f2string_ti:rdmap*
    hits = 4

    The above result are both correct based on my data.

    Now if I change the line to:

    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank* OR f2string_ti:rdmap*");

    Output:
    query = f2string_ti:subbank* f2string_ti:rdmap*
    hits = 2


    I assume the count in the last result should be larger than max(6,4), but it is 2. Any reason for that?

    Thanks


    _________________________________________________________________
    Hotmail: Trusted email with powerful SPAM protection.
    http://clk.atdmt.com/GBL/go/201469227/direct/01/
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Java8964 java8964 at Feb 3, 2010 at 3:08 pm
    Thanks for you help.

    I upgrade the lucene to 2.9.1, the problem is gone. It looks like a boolean query bug in the lucene 2.9.0 and fixed in the 2.9.1

    Thanks
    From: ian.lea@gmail.com
    Date: Wed, 3 Feb 2010 10:02:27 +0000
    Subject: Re: confused by the lucene boolean query with wildcard result
    To: java-user@lucene.apache.org

    You should probably be using your PerFieldAnalyzerWrapper in your
    calls to QueryParser but apart from that I can't see any obvious
    reason. General advice: use Luke to check what has been indexed and
    read http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F

    If none of these help, post again but showing what you are indexing as
    well as how you are searching - the smallest possible test case or
    self-contained program that shows the problem.

    Or maybe someone else will spot the problem.


    --
    Ian.


    On Tue, Feb 2, 2010 at 8:56 PM, java8964 java8964 wrote:

    Hi, I have the following test case point to the index generated in our application. The result is confusing me and I don't know the reason.

    Lucene version: 2.9.0
    JDK 1.6.0_18

    public class IndexTest1 {
    public static void main(String[] args) {
    try {
    FSDirectory directory = FSDirectory.open(new File("/path_to_index_files"));
    IndexSearcher searcher = new IndexSearcher(directory, true);
    PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer());
    wrapper.addAnalyzer("f1string_sif", new KeywordAnalyzer());
    wrapper.addAnalyzer("f2string_ti", new StandardAnalyzer(Version.LUCENE_CURRENT));
    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank*");
    System.out.println("query = " + query);
    System.out.println("hits = " + searcher.search(query, 100).totalHits);
    searcher.close();
    } catch (Exception e) {
    System.out.println(e);
    }
    }
    }

    Output:
    query = f2string_ti:subbank*
    hits = 6

    If I change the line to the following:

    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:rdmap*");

    Output:
    query = f2string_ti:rdmap*
    hits = 4

    The above result are both correct based on my data.

    Now if I change the line to:

    Query query = new QueryParser("f1string_sif", new StandardAnalyzer(Version.LUCENE_CURRENT)).parse("f2string_ti:subbank* OR f2string_ti:rdmap*");

    Output:
    query = f2string_ti:subbank* f2string_ti:rdmap*
    hits = 2


    I assume the count in the last result should be larger than max(6,4), but it is 2. Any reason for that?

    Thanks


    _________________________________________________________________
    Hotmail: Trusted email with powerful SPAM protection.
    http://clk.atdmt.com/GBL/go/201469227/direct/01/
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    _________________________________________________________________
    Hotmail: Powerful Free email with security by Microsoft.
    http://clk.atdmt.com/GBL/go/201469230/direct/01/
  • Asif Nawaz at Feb 3, 2010 at 11:00 am
    In HotelDatabase project of lucene, Following code is written in performSearch method of SearchEngine class.

    Let queryString = "Located in the heart of paris"

    Analyzer analyzer = new StandardAnalyzer();
    IndexSearcher is = new IndexSearcher("index");
    QueryParser parser = new QueryParser("content", analyzer);
    Query query = parser.parse(queryString);
    Hits hits = is.search(query);

    To be specific what i want here
    i) Removing stop words from query string and use stemming, so new query string should become "Locate heart paris"
    ii) How to get term frequency (tf) of each word in query?
    iii) How to get Document Frequency(df) of each word in query?
    iv) How to get Inverse Document Frequency (idf) of each word in query?


    Can u please let me know some solution that give answer of all my four questions. Or can you refer me to some sample code. I have tried boolean query but unable to do this.



    From: thienthanhomenh@gmail.com
    Date: Wed, 3 Feb 2010 04:59:49 +0900
    Subject: Re: Getting DF & IDF
    To: java-user@lucene.apache.org

    with my idea,
    using BooleanQuery, you can make every thing.

    On Mon, Feb 1, 2010 at 10:44 PM, Asif Nawaz wrote:


    Hi, I am new to use lucene, I have a query string of multiple terms. i) i
    want to return query string by removing stop words and stemmed version of
    the query.
    ii) second i want to get tf and idf of each term in a query, how to get it?







    Asif


    _________________________________________________________________
    Hotmail: Trusted email with powerful SPAM protection.
    https://signup.live.com/signup.aspx?id=60969
    _________________________________________________________________
    Hotmail: Trusted email with Microsoft’s powerful SPAM protection.
    https://signup.live.com/signup.aspx?id=60969
  • Sethu_424 at May 26, 2010 at 2:15 pm
    Hi,
    I am not sure if you are still searching the answer for your question. If
    so, then please read on...

    You can get the DF & IDF for each of the query terms in the query as below..

    IndexReader reader = IndexReader.open(FSDirectory.open(new File(indexDir)),
    true);

    //Create a FilterIndexReader to invoke the abstract methods
    FilterIndexReader filterIndexReader = new FilterIndexReader(reader);

    //Number of documents in the index
    int numDocs = filterIndexReader.numDocs();

    //Iterate over each of the query words
    for(String queryWord : queryWords){
    Term term = new Term(searchField, queryWord.toLowerCase());

    int docFreq = 0;
    try {
    docFreq = filterIndexReader.docFreq(term);
    } catch (IOException e) {
    logger.log(Level.SEVERE, null, e);
    }

    //Calculate IDF
    double idf = 0.0;
    if(docFreq > 0){
    idf = Math.log10((double) numDocs / docFreq);
    }

    System.out.println(queryWord + "\tDF -" + docFreq + "\tIDF -" + idf);
    }

    --
    View this message in context: http://lucene.472066.n3.nabble.com/Getting-DF-IDF-tp547386p844962.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yura.minsk at May 20, 2012 at 10:31 am
    int numDocs = filterIndexReader.numDocs();
    ...
    idf = Math.log10((double) numDocs / docFreq);
    Sethu_424 wrote
    wrong formula. numDoc should not be a count of documents in index - but
    documents containing searching term.
    We need something like IndexReader.docFreq( term );

    --
    View this message in context: http://lucene.472066.n3.nabble.com/Getting-DF-IDF-tp547386p3984938.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 1, '10 at 1:44p
activeMay 20, '12 at 10:31a
posts8
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase