FAQ
Hello:

I have a problem where I need to search for the term "C++".
If I use StandardAnalyzer, the "+" characters are removed and the
search is done on just the "c" character which is not what is
intended.
Yet, I need to use standard analyzer for the other benefits it provides.

I think I need to write a specialized tokenizer (and accompanying
analyzer) that let the "+" characters pass.
I would use the JFlex provided one, modify it and add it to my project.

My question is:

Is there any simpler way to accomplish the same?


Best regards,
Alex Soto
lexsoto@gmail.com

-
Amicus Plato, sed magis amica veritas.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • John Byrne at Jun 24, 2008 at 4:06 pm
    I don't think there is a simpler way. I think you will have to modify
    the tokenizer. Once you go beyond basic human-readable text, you always
    end up having to do that. I have modified the JavaCC version of
    StandardTokenizer for allowing symbols to pass through, but I've never
    used the JFlex version - don't know anything about JFlex I'm afraid!

    A good strategy might be to make a new type of lexical token called
    "SYMBOL" and try to catch as many symbols as you can think of; then
    maybe create new token types which are ALPHANUM types that can have
    pre-fixed or post-fixed symbols.

    That way, you'll be able to catch things like "c++" in a TokenFilter,
    and you can choose to pass it through as a single token, or split it up
    into two tokens, or whatever you want.

    Hope that helps.

    Regards,
    JB

    Alex Soto wrote:
    Hello:

    I have a problem where I need to search for the term "C++".
    If I use StandardAnalyzer, the "+" characters are removed and the
    search is done on just the "c" character which is not what is
    intended.
    Yet, I need to use standard analyzer for the other benefits it provides.

    I think I need to write a specialized tokenizer (and accompanying
    analyzer) that let the "+" characters pass.
    I would use the JFlex provided one, modify it and add it to my project.

    My question is:

    Is there any simpler way to accomplish the same?


    Best regards,
    Alex Soto
    lexsoto@gmail.com

    -
    Amicus Plato, sed magis amica veritas.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • N. Hira at Jun 24, 2008 at 4:14 pm
    This isn't ideal, but if you have a defined list of such terms, you
    may find it easier to filter these terms out into a separate field
    for indexing.

    -h
    ----------------------------------------------------------------------
    Hira, N.R.
    Solutions Architect
    Cognocys, Inc.
    (773) 251-7453
    On 24-Jun-2008, at 11:03 AM, John Byrne wrote:

    I don't think there is a simpler way. I think you will have to
    modify the tokenizer. Once you go beyond basic human-readable text,
    you always end up having to do that. I have modified the JavaCC
    version of StandardTokenizer for allowing symbols to pass through,
    but I've never used the JFlex version - don't know anything about
    JFlex I'm afraid!

    A good strategy might be to make a new type of lexical token called
    "SYMBOL" and try to catch as many symbols as you can think of; then
    maybe create new token types which are ALPHANUM types that can have
    pre-fixed or post-fixed symbols.

    That way, you'll be able to catch things like "c++" in a
    TokenFilter, and you can choose to pass it through as a single
    token, or split it up into two tokens, or whatever you want.

    Hope that helps.

    Regards,
    JB

    Alex Soto wrote:
    Hello:

    I have a problem where I need to search for the term "C++".
    If I use StandardAnalyzer, the "+" characters are removed and the
    search is done on just the "c" character which is not what is
    intended.
    Yet, I need to use standard analyzer for the other benefits it
    provides.

    I think I need to write a specialized tokenizer (and accompanying
    analyzer) that let the "+" characters pass.
    I would use the JFlex provided one, modify it and add it to my
    project.

    My question is:

    Is there any simpler way to accomplish the same?


    Best regards,
    Alex Soto
    lexsoto@gmail.com

    -
    Amicus Plato, sed magis amica veritas.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org







    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Alex Soto at Jun 24, 2008 at 4:40 pm
    Thanks everyone. I appreciate the help.

    I think I will write my own tokenizer, because I do not have a
    predefined list of words with symbols.
    I will modify the grammar by defining a SYMBOL token as John suggested
    and redefine ALPHANUM to include it.

    Regards,
    Alex Soto


    On Tue, Jun 24, 2008 at 12:12 PM, N. Hira wrote:
    This isn't ideal, but if you have a defined list of such terms, you may find
    it easier to filter these terms out into a separate field for indexing.

    -h
    ----------------------------------------------------------------------
    Hira, N.R.
    Solutions Architect
    Cognocys, Inc.
    (773) 251-7453
    On 24-Jun-2008, at 11:03 AM, John Byrne wrote:

    I don't think there is a simpler way. I think you will have to modify the
    tokenizer. Once you go beyond basic human-readable text, you always end up
    having to do that. I have modified the JavaCC version of StandardTokenizer
    for allowing symbols to pass through, but I've never used the JFlex version
    - don't know anything about JFlex I'm afraid!

    A good strategy might be to make a new type of lexical token called
    "SYMBOL" and try to catch as many symbols as you can think of; then maybe
    create new token types which are ALPHANUM types that can have pre-fixed or
    post-fixed symbols.

    That way, you'll be able to catch things like "c++" in a TokenFilter, and
    you can choose to pass it through as a single token, or split it up into two
    tokens, or whatever you want.

    Hope that helps.

    Regards,
    JB

    Alex Soto wrote:
    Hello:

    I have a problem where I need to search for the term "C++".
    If I use StandardAnalyzer, the "+" characters are removed and the
    search is done on just the "c" character which is not what is
    intended.
    Yet, I need to use standard analyzer for the other benefits it provides.

    I think I need to write a specialized tokenizer (and accompanying
    analyzer) that let the "+" characters pass.
    I would use the JFlex provided one, modify it and add it to my project.

    My question is:

    Is there any simpler way to accomplish the same?


    Best regards,
    Alex Soto
    lexsoto@gmail.com

    -
    Amicus Plato, sed magis amica veritas.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org







    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Alex Soto
    lexsoto@gmail.com

    -
    Amicus Plato, sed magis amica veritas.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 24, '08 at 3:49p
activeJun 24, '08 at 4:40p
posts4
users3
websitelucene.apache.org

3 users in discussion

Alex Soto: 2 posts John Byrne: 1 post N. Hira: 1 post

People

Translate

site design / logo © 2022 Grokbase