FAQ
Does the QueryParser class really uses the Analyzer passed to the parse
method ?

I look at the code and i dont the object beeing used anywhere in the
class. The problem is that i am writting an application with lucene that
searches using a foreign language with latin characters, the indexing
works fine, but the search aparently doesn't call the Analyzer.

Here is an example:
i have a file that contains the following word: memória
if i search for: memoria (without the puntuation charecter in the o) it
finds the word, which is correct
if i search for: memória (the exact same word) it doesn't find the word,
because the QueryParser splits the word to "mem ria", but if the
analyzer were called the "ó" would be replaced to "o". I guess the
analyzer isn't called, is this right?

Thanks in advance,
Ricardo Lopes

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Search Discussions

  • Erik Hatcher at Nov 30, 2004 at 6:04 pm

    On Nov 30, 2004, at 10:42 AM, Ricardo Lopes wrote:
    Does the QueryParser class really uses the Analyzer passed to the
    parse method ?
    Absolutely.
    I look at the code and i dont the object beeing used anywhere in the
    class. The problem is that i am writting an application with lucene
    that searches using a foreign language with latin characters, the
    indexing works fine, but the search aparently doesn't call the
    Analyzer.
    look at the getFieldQuery method. It uses it to extract the tokens
    from each part of the query (phrases and stand-alone terms).
    Here is an example:
    i have a file that contains the following word: memória
    if i search for: memoria (without the puntuation charecter in the o)
    it finds the word, which is correct
    if i search for: memória (the exact same word) it doesn't find the
    word, because the QueryParser splits the word to "mem ria", but if the
    analyzer were called the "ó" would be replaced to "o". I guess the
    analyzer isn't called, is this right?
    What Analyzer are you using? My guess is that your analyzer is what
    did the splitting, though it could be something fishy in how you got
    the string into QueryParser in the first place?

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Ricardo Lopes at Nov 30, 2004 at 7:30 pm
    i was using an adaptation of the SearchFiles class distibuted in the
    demo (demo.org.apache.lucene.demo.SearchFiles)
    The Analyzer is the BrazilianAnalyzer avaliable in the sandbox
    (org.apache.lucene.analysis.br.BrazilianAnalyzer)**
    My guess is that your analyzer is what did the splitting
    After looker with more attetion to the code i found that the tokenStream
    method in the BrazilianAnalyzer calls the StandardTokenizer and is this
    the one that split the search string, is there a simple way of subclass
    the tokenizer to avoid splitting those characters or do i have make a
    custom implementation of that class.

    though it could be something fishy in how you got the string into
    QueryParser in the first place?

    As this only happends when i make a search (during indexing the
    splitting of those characters doesn't happend) i thought that i had to
    do with the QueryParser, but it seems that the problem is with the
    StandardTokenizer.

    Thanks

    Erik Hatcher wrote:
    On Nov 30, 2004, at 10:42 AM, Ricardo Lopes wrote:

    Does the QueryParser class really uses the Analyzer passed to the
    parse method ?

    Absolutely.
    I look at the code and i dont the object beeing used anywhere in the
    class. The problem is that i am writting an application with lucene
    that searches using a foreign language with latin characters, the
    indexing works fine, but the search aparently doesn't call the Analyzer.

    look at the getFieldQuery method. It uses it to extract the tokens
    from each part of the query (phrases and stand-alone terms).
    Here is an example:
    i have a file that contains the following word: memória
    if i search for: memoria (without the puntuation charecter in the o)
    it finds the word, which is correct
    if i search for: memória (the exact same word) it doesn't find the
    word, because the QueryParser splits the word to "mem ria", but if
    the analyzer were called the "ó" would be replaced to "o". I guess
    the analyzer isn't called, is this right?

    What Analyzer are you using? My guess is that your analyzer is what
    did the splitting, though it could be something fishy in how you got
    the string into QueryParser in the first place?

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Erik Hatcher at Nov 30, 2004 at 7:59 pm

    On Nov 30, 2004, at 2:29 PM, Ricardo Lopes wrote:
    My guess is that your analyzer is what did the splitting
    After looker with more attetion to the code i found that the
    tokenStream method in the BrazilianAnalyzer calls the
    StandardTokenizer and is this the one that split the search string, is
    there a simple way of subclass the tokenizer to avoid splitting those
    characters or do i have make a custom implementation of that class.
    You can verify this by using the AnalysisDemo referenced here:

    http://wiki.apache.org/jakarta-lucene/AnalysisParalysis

    Or use Luke - http://www.getopt.org/luke/ - which has a nice plugin
    page that can do this type of analysis inspection (you'll have to add
    the sandbox analyzer JAR to the classpath when launching Luke).

    As for subclassing StandardTokenizer - no, you won't have much luck
    there. StandardTokenizer is a JavaCC-based tokenizer and is not
    designed for subclassing to control this sort of thing.
    As this only happends when i make a search (during indexing the
    splitting of those characters doesn't happend)
    Are you sure that splitting is not happening during indexing? If the
    AnalysisDemo (or Luke) run on your string splits then it is splitting
    at indexing time too. Keep in mind that looking at a field's value is
    showing you the stored *original* value, not the tokenized values.
    i thought that i had to do with the QueryParser, but it seems that
    the problem is with the StandardTokenizer.
    I'm not sure - I haven't tried that string with the analyzer you
    provided. If it was with StandardTokenizer and you're using the same
    analyzer for indexing and searching, you'd have the values split in
    both places - which is actually fine as searches would match what was
    indexed :)

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Ricardo Lopes at Dec 2, 2004 at 5:29 pm
    I tried luke and is great, i don't like the code but the tool is really
    good.

    I found the problem, but i don't understand why.

    This was the old code (doesn't work) :

    --------//-----------------
    IndexSearcher searcher = new IndexSearcher("data");
    BrazilianAnalyzer analyzer = new BrazilianAnalyzer();

    // begin of code block that doesn't work
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    System.out.print("Query: ");
    String line = in.readLine();
    // end of code that doesn't work

    Query query = QueryParser.parse(line, "contents", analyzer);
    Hits hits = searcher.search(query);
    System.out.println(hits.length() + " total matching documents");

    -------------//------------
    but if i replace the above code that doesn't work by this it works fine:

    String line = "text to search";

    This doesn't have anything to do with lucene, but why does it work if i
    supply the code directly into the string and doesn't work usign the
    inputstream?
    Does it has something to do with the encoding or something like that ?
    is it problem of the windows shell that passes the punctuation
    characters in an incorrect way?

    Thanks for your help

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedNov 30, '04 at 3:43p
activeDec 2, '04 at 5:29p
posts5
users2
websitelucene.apache.org

2 users in discussion

Ricardo Lopes: 3 posts Erik Hatcher: 2 posts

People

Translate

site design / logo © 2022 Grokbase