FAQ
Hi,

I was wonder if there is a list of special characters for the standard analyzer?

What I mean by "special" is characters that the analyzer considers break characters. For example, if I have something like "foo=something", apparently the analyzer considers this as two terms, "foo" and "something.

Thanks,
Jim

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Phil Whelan at Jul 31, 2009 at 4:36 am

    On Thu, Jul 30, 2009 at 7:12 PM, wrote:
    I was wonder if there is a list of special characters for the standard analyzer?

    What I mean by "special" is characters that the analyzer considers break characters.
    For example, if I have something like "foo=something", apparently the analyzer
    considers this as two terms, "foo" and "something.
    Hi Jim,

    This is what I could find in the docs...

    StandardAnalyzer uses StandardTokenizer

    http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
    * Splits words at punctuation characters, removing punctuation.
    However, a dot that's not followed by whitespace is considered part of
    a token.
    * Splits words at hyphens, unless there's a number in the token, in
    which case the whole token is interpreted as a product number and is
    not split.
    * Recognizes email addresses and internet hostnames as one token.

    Also, these are the tokens that will be removed..

    public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such",
    "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
    };

    Thanks,
    Phil

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Ohaya at Jul 31, 2009 at 6:03 am

    ---- Phil Whelan wrote:
    On Thu, Jul 30, 2009 at 7:12 PM, wrote:
    I was wonder if there is a list of special characters for the standard analyzer?

    What I mean by "special" is characters that the analyzer considers break characters.
    For example, if I have something like "foo=something", apparently the analyzer
    considers this as two terms, "foo" and "something.
    Hi Jim,

    This is what I could find in the docs...

    StandardAnalyzer uses StandardTokenizer

    http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
    * Splits words at punctuation characters, removing punctuation.
    However, a dot that's not followed by whitespace is considered part of
    a token.
    * Splits words at hyphens, unless there's a number in the token, in
    which case the whole token is interpreted as a product number and is
    not split.
    * Recognizes email addresses and internet hostnames as one token.

    Also, these are the tokens that will be removed..

    public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such",
    "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
    };

    Thanks,
    Phil

    Hi Phil,

    I guess that the obvious question is "Which characters are considered 'punctuation characters'?".

    In particular, does the analyzer consider "=" (equal) and ":" (colon) to be punctuation characters?

    Thanks,
    Jim

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • AHMET ARSLAN at Jul 31, 2009 at 7:27 am

    I guess that the obvious question is "Which characters are
    considered 'punctuation characters'?".
    Punctuation = ("_"|"-"|"/"|"."|",")
    In particular, does the analyzer consider "=" (equal) and
    ":" (colon) to be punctuation characters?
    ":" is special character at QueryParser (if you are using it). If you want to search it you need to escape it first. At index time this character is ignored. Like the punctuations. The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "=" character in both query/index time.

    If you want to understand the behavior of StandardTokenizer, you need to look at the file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these token types, similar to Regular Expression. You can change behavior of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it. There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can understand how new token types can be added.

    Ahmet




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Ohaya at Jul 31, 2009 at 3:00 pm
    Hi Ahmet,

    Thanks for the clarification and information! That was exactly what I was looking for.

    Jim


    ---- AHMET ARSLAN wrote:
    I guess that the obvious question is "Which characters are
    considered 'punctuation characters'?".
    Punctuation = ("_"|"-"|"/"|"."|",")
    In particular, does the analyzer consider "=" (equal) and
    ":" (colon) to be punctuation characters?
    ":" is special character at QueryParser (if you are using it). If you want to search it you need to escape it first. At index time this character is ignored. Like the punctuations. The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "=" character in both query/index time.

    If you want to understand the behavior of StandardTokenizer, you need to look at the file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these token types, similar to Regular Expression. You can change behavior of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it. There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can understand how new token types can be added.

    Ahmet




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Simon Willnauer at Jul 31, 2009 at 3:13 pm

    On Fri, Jul 31, 2009 at 5:00 PM, wrote:
    Hi Ahmet,

    Thanks for the clarification and information!  That was exactly what I was looking for.

    Jim


    ---- AHMET ARSLAN wrote:
    I guess that the obvious question is "Which characters are
    considered 'punctuation characters'?".
    Punctuation = ("_"|"-"|"/"|"."|",")
    Those punctuation are only for floating point, ip-addresses etc.
    StandardTokenizer does not have punctuation explicitly set. You can
    assume that it will drop and split on almost all punctuations coming
    along in the input string.

    Have a look at StandardTokenizerImpl.jflex the gramma is quiet easy to
    understand and gives you a better idea what this tokenizer does.

    simon
    In particular, does the analyzer consider "=" (equal) and
    ":" (colon) to be punctuation characters?
    ":" is special character at QueryParser (if you are using it). If you want to search it you need to escape it first. At index time this character is ignored. Like the punctuations. The string ahmet:arslan will produce two tokens ahmet and arslan. It also breaks words at "=" character in both query/index time.

    If you want to understand the behavior of StandardTokenizer, you need to look at the file StandardTokenizerImpl.jflex. It recognizes the followings as one token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these token types, similar to Regular Expression. You can change behavior of StandardTokenizer by editing this file and generating StandardTokenizerImpl.java from it. There is also another jflex file named WikipediaTokenizerImpl.jflex. By looking it you can understand how new token types can be added.

    Ahmet




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 31, '09 at 2:12a
activeJul 31, '09 at 3:13p
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase