I don't think there is a simpler way. I think you will have to modify
the tokenizer. Once you go beyond basic human-readable text, you always
end up having to do that. I have modified the JavaCC version of
StandardTokenizer for allowing symbols to pass through, but I've never
used the JFlex version - don't know anything about JFlex I'm afraid!
A good strategy might be to make a new type of lexical token called
"SYMBOL" and try to catch as many symbols as you can think of; then
maybe create new token types which are ALPHANUM types that can have
pre-fixed or post-fixed symbols.
That way, you'll be able to catch things like "c++" in a TokenFilter,
and you can choose to pass it through as a single token, or split it up
into two tokens, or whatever you want.
Hope that helps.
Regards,
JB
Alex Soto wrote:
Hello:
I have a problem where I need to search for the term "C++".
If I use StandardAnalyzer, the "+" characters are removed and the
search is done on just the "c" character which is not what is
intended.
Yet, I need to use standard analyzer for the other benefits it provides.
I think I need to write a specialized tokenizer (and accompanying
analyzer) that let the "+" characters pass.
I would use the JFlex provided one, modify it and add it to my project.
My question is:
Is there any simpler way to accomplish the same?
Best regards,
Alex Soto
lexsoto@gmail.com
-
Amicus Plato, sed magis amica veritas.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org