On Thu, Jul 30, 2009 at 7:12 PM, wrote:
I was wonder if there is a list of special characters for the standard analyzer?
What I mean by "special" is characters that the analyzer considers break characters.
For example, if I have something like "foo=something", apparently the analyzer
considers this as two terms, "foo" and "something.
Hi Jim,
This is what I could find in the docs...
StandardAnalyzer uses StandardTokenizer
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html* Splits words at punctuation characters, removing punctuation.
However, a dot that's not followed by whitespace is considered part of
a token.
* Splits words at hyphens, unless there's a number in the token, in
which case the whole token is interpreted as a product number and is
not split.
* Recognizes email addresses and internet hostnames as one token.
Also, these are the tokens that will be removed..
public static final String[] ENGLISH_STOP_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};
Thanks,
Phil
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]