FAQ
Hi there,
Any ideas you have about the following would be greatly appreciated.

I'd like apostropes to break up a word into two for indexing - ie, the
french l'observatoire would be indexed as two separate tokens, l
observatoire. My understanding from reading documentation and list
archives is that StandardAnalyzer should do this. However, it is not
working that way for me, and l'observatoire is indexing as one word.
Interstingly, l`observatoire (ie, with the other, less common
apostrophe) is indexing properly, as l observatoire .

Here is the test I've written and the output I'm getting.

TEST


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.StringReader;
import java.io.IOException;

public class LuceneTokenizingTest {

static Analyzer analyzer;

public static void main(String[] args) throws IOException {
analyzer = new StandardAnalyzer(new String[] {});

testTokenizeUnusualApostrophe();
testTokenizeUsualApostrophe();
}

public static void testTokenizeUnusualApostrophe() {
System.out.println( "Testing how l`observatoire is tokenized." );
System.out.println( "Expecting: [l] [observatoire] ");
System.out.println( "Getting: " +
analyze("l`observatoire") + "\n\n");
}

public static void testTokenizeUsualApostrophe() {
System.out.println( "Testing how l'observatoire is tokenized." );
System.out.println( "Expecting: [l] [observatoire] ");
System.out.println( "Getting: " + analyze("l'observatoire") );
}

private static String analyze(String text) {
String returnString="";
try{
TokenStream stream = analyzer.tokenStream("contents", new
StringReader(text));
while (true) {
Token token = stream.next();
if (token == null) break;
returnString = returnString + "[" +
token.termText() + "] ";
}
}catch(IOException e){
System.out.println("Exception: " + e.toString());
}
return returnString;
}
}



OUTPUT

Testing how l`observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l] [observatoire]


Testing how l'observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l'observatoire]




Thanks so much for your help!

Sarah

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Karel Tejnora at Nov 14, 2006 at 1:28 pm
    Apostrophe is recognized as a part of word - Standard analyzer is mostly
    English oriented.
    The way is to swap apostrophes - "normal" with unusual.

    StandardAnalyzer.java line 40-44

    APOSTROPHE:
    token = jj_consume_token(APOSTROPHE);



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sarah Hunter at Nov 14, 2006 at 3:52 pm
    That was my first thought as well, but it looks like APOSTROPHE is
    already the one that I want. As you can see, from StandardAnalyzer.jj

    -------------------
    TOKEN : { // token patterns

    // basic word: a sequence of digits & letters
    <ALPHANUM: (<LETTER>|<DIGIT>|<KOREAN>)+ >

    // internal apostrophes: O'Reilly, you're, O'Reilly's
    // use a post-filter to remove possesives
    <APOSTROPHE: <ALPHA> ("'" <ALPHA>)+ >
    -------------------

    It really looks like it should work for ' rather than `, but it does not.

    Thanks for the reply! Hopefully you or someone else can point out
    what's going on or where I'm going wrong.
    Sarah
    On 11/14/06, Karel Tejnora wrote:
    Apostrophe is recognized as a part of word - Standard analyzer is mostly
    English oriented.
    The way is to swap apostrophes - "normal" with unusual.

    StandardAnalyzer.java line 40-44

    APOSTROPHE:
    token = jj_consume_token(APOSTROPHE);



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karel Tejnora at Nov 14, 2006 at 9:06 pm
    The problem is in StandardTokenizer so Analyzer with method:

    public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new LowerCaseTokenizer(reader);
    result = new StopFilter(result, stopSet);
    return result;
    }

    if you need everything standard analyzer does
    From StandardTokenizer.jj

    remove token def.
    176 token = <APOSTROPHE> |

    than compile. l'aabb shall be recognized as <ALPHA> <ALPHA>

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 13, '06 at 9:51p
activeNov 14, '06 at 9:06p
posts4
users2
websitelucene.apache.org

2 users in discussion

Sarah Hunter: 2 posts Karel Tejnora: 2 posts

People

Translate

site design / logo © 2022 Grokbase