FAQ
I recently upgraded to Lucene 3.0 and am seeing some new behavior that I don't understand. Perhaps someone can explain why.



I have a custom analyzer. Part of the analyzer uses the AsciiFoldingFilter. If I run a word with an umlaut through that analyzer using the AnalyzerDemo code in LIA2, as expected, I get the same word except that the umlauted letter is now a simple ascii letter (no umlaut). That's what I would expect and want.



If I create a Queryparser using the call "new QueryParser(LUCENE_30, "body", myAnalyzer) and then call the parse() method passing the same word, I can see that the query parser has not removed the umlaut. The string it has is "+body: Europabörsen".



I know I had to make a number of changes to the analyzer and the tokenizer to upgrade to 3.x. Is there something very different from the 2.x version that I'm likely missing.



Anyone have any thoughts?

Search Discussions

  • Simon Willnauer at Sep 17, 2010 at 7:03 am

    On Fri, Sep 17, 2010 at 1:06 AM, Scott Smith wrote:
    I recently upgraded to Lucene 3.0 and am seeing some new behavior that I don't understand.  Perhaps someone can explain why.



    I have a custom analyzer.  Part of the analyzer uses the AsciiFoldingFilter.  If I run a word with an umlaut through that analyzer using the AnalyzerDemo code in LIA2, as expected, I get the same word except that the umlauted letter is now a simple ascii letter (no umlaut).  That's what I would expect and want.



    If I create a Queryparser using the call "new QueryParser(LUCENE_30, "body", myAnalyzer) and then call the parse() method passing the same word, I can see that the query parser has not removed the umlaut.  The string it has is "+body: Europabörsen".
    This seems to be an issue with your analyzer rather than with the
    QueryParser. Since QueryParser didn't really change its behavior in
    3.0 except of some default values. Can you provide more info what you
    did with your analyzer? Did you try running the term with umlaut chars
    through your Analyzer / Tokenstream directly? Something like that:

    Analyzer a = new MyAnalyzer();
    TokenStream stream = a.reusableTokenStream("body", new
    StringReader("Europabörsen"));
    TermAttribute attr = stream.addAttribute(TermAttribute.class);
    while(stream.incrementToken())
    System.out.println(attr.term());

    simon

    I know I had to make a number of changes to the analyzer and the tokenizer to upgrade to 3.x.  Is there something very different from the 2.x version that I'm likely missing.



    Anyone have any thoughts?




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Scott Smith at Sep 17, 2010 at 5:34 pm
    First, let me say that I didn't think the problem was in QueryParser and I apologize if that's how it sounded. QueryParser is a central method to Lucene. 1 of me having problems with QueryParser, 1000's of others not. Is the problem more likely in my code or lucene. We'll all agree on the answer to that question.

    As further proof, I ran the following code. The first part is from Simon's email (thanks for that snippet) and the second part is from LIA2.

    // code from Willnauer email
    Analyzer a = new MyAnalyzer(Version.LUCENE_30);
    TokenStream stream = a.reusableTokenStream("body", new StringReader("Europabörsen"));
    TermAttribute attr = stream.addAttribute(TermAttribute.class);
    while(stream.incrementToken())
    {
    System.out.println(attr.term());
    }

    // code from LIA2
    stream = a.tokenStream("body", new StringReader("Europabörsen"));
    TermAttribute term = stream.addAttribute(TermAttribute.class);
    while (stream.incrementToken())
    {
    System.out.print(term.term());
    }


    The answer I got back was:
    europabörsen
    europaborsen

    I realized the difference between these two was whether I was getting the reusableTokeStream or the tokenStream. In looking at my code, the ASCIIFoldingFilter was not in the filter setup for the resusableTokenStream(). It was for the tokenStream(). I added it to the reusableTokenStream and I now get the result I wanted. The above code snippet generates the word without the umlaut in both cases. So, problem solved.

    Thanks to Simon for putting on the right track.

    Scott


    -----Original Message-----
    From: Simon Willnauer
    Sent: Friday, September 17, 2010 1:03 AM
    To: java-user@lucene.apache.org
    Subject: Re: QueryParser in 3.x
    On Fri, Sep 17, 2010 at 1:06 AM, Scott Smith wrote:
    I recently upgraded to Lucene 3.0 and am seeing some new behavior that I don't understand.  Perhaps someone can explain why.



    I have a custom analyzer.  Part of the analyzer uses the AsciiFoldingFilter.  If I run a word with an umlaut through that analyzer using the AnalyzerDemo code in LIA2, as expected, I get the same word except that the umlauted letter is now a simple ascii letter (no umlaut).  That's what I would expect and want.



    If I create a Queryparser using the call "new QueryParser(LUCENE_30, "body", myAnalyzer) and then call the parse() method passing the same word, I can see that the query parser has not removed the umlaut.  The string it has is "+body: Europabörsen".
    This seems to be an issue with your analyzer rather than with the
    QueryParser. Since QueryParser didn't really change its behavior in
    3.0 except of some default values. Can you provide more info what you
    did with your analyzer? Did you try running the term with umlaut chars
    through your Analyzer / Tokenstream directly? Something like that:

    Analyzer a = new MyAnalyzer();
    TokenStream stream = a.reusableTokenStream("body", new
    StringReader("Europabörsen"));
    TermAttribute attr = stream.addAttribute(TermAttribute.class);
    while(stream.incrementToken())
    System.out.println(attr.term());

    simon

    I know I had to make a number of changes to the analyzer and the tokenizer to upgrade to 3.x.  Is there something very different from the 2.x version that I'm likely missing.



    Anyone have any thoughts?




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Sep 17, 2010 at 7:31 pm

    On Fri, Sep 17, 2010 at 7:34 PM, Scott Smith wrote:
    First, let me say that I didn't think the problem was in QueryParser and I apologize if that's how it sounded.  QueryParser is a central method to Lucene.  1 of me having problems with QueryParser, 1000's of others not.  Is the problem more likely in my code or lucene.  We'll all agree on the answer to that question.
    Don't worry :)
    As further proof, I ran the following code.  The first part is from Simon's email (thanks for that snippet) and the second part is from LIA2.

    // code from Willnauer email
    Analyzer a = new MyAnalyzer(Version.LUCENE_30);
    TokenStream stream = a.reusableTokenStream("body", new StringReader("Europabörsen"));
    TermAttribute attr = stream.addAttribute(TermAttribute.class);
    while(stream.incrementToken())
    {
    System.out.println(attr.term());
    }

    // code from LIA2
    stream = a.tokenStream("body", new StringReader("Europabörsen"));
    TermAttribute term = stream.addAttribute(TermAttribute.class);
    while (stream.incrementToken())
    {
    System.out.print(term.term());
    }


    The answer I got back was:
    europabörsen
    europaborsen

    I realized the difference between these two was whether I was getting the reusableTokeStream or the tokenStream.  In looking at my code, the ASCIIFoldingFilter was not in the filter setup for the resusableTokenStream().  It was for the tokenStream().  I added it to the reusableTokenStream and I now get the result I wanted.  The above code snippet generates the word without the umlaut in both cases.  So, problem solved.

    Thanks to Simon for putting on the right track.
    you are using lucene 3.0? If so take a look at ReusableAnalyzerBase
    which makes it much easier to build Analyzers and prevents code
    duplication.

    simon
    Scott


    -----Original Message-----
    From: Simon Willnauer
    Sent: Friday, September 17, 2010 1:03 AM
    To: java-user@lucene.apache.org
    Subject: Re: QueryParser in 3.x
    On Fri, Sep 17, 2010 at 1:06 AM, Scott Smith wrote:
    I recently upgraded to Lucene 3.0 and am seeing some new behavior that I don't understand.  Perhaps someone can explain why.



    I have a custom analyzer.  Part of the analyzer uses the AsciiFoldingFilter.  If I run a word with an umlaut through that analyzer using the AnalyzerDemo code in LIA2, as expected, I get the same word except that the umlauted letter is now a simple ascii letter (no umlaut).  That's what I would expect and want.



    If I create a Queryparser using the call "new QueryParser(LUCENE_30, "body", myAnalyzer) and then call the parse() method passing the same word, I can see that the query parser has not removed the umlaut.  The string it has is "+body: Europabörsen".
    This seems to be an issue with your analyzer rather than with the
    QueryParser. Since QueryParser didn't really change its behavior in
    3.0 except of some default values. Can you provide more info what you
    did with your analyzer? Did you try running the term with umlaut chars
    through your Analyzer / Tokenstream directly? Something like that:

    Analyzer a = new MyAnalyzer();
    TokenStream stream = a.reusableTokenStream("body", new
    StringReader("Europabörsen"));
    TermAttribute attr = stream.addAttribute(TermAttribute.class);
    while(stream.incrementToken())
    System.out.println(attr.term());

    simon

    I know I had to make a number of changes to the analyzer and the tokenizer to upgrade to 3.x.  Is there something very different from the 2.x version that I'm likely missing.



    Anyone have any thoughts?




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 16, '10 at 11:07p
activeSep 17, '10 at 7:31p
posts4
users2
websitelucene.apache.org

2 users in discussion

Simon Willnauer: 2 posts Scott Smith: 2 posts

People

Translate

site design / logo © 2022 Grokbase