FAQ
Hello

The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
token. Is there a way I can make StandardAnalyzer don't tokenize for
"_" or any given characters?

I'd like to keep all features that StandardAnalyzer have but want to
modified it a bit for my need? How do I control what character is
tokenizable.

Ex: Test_test1_test2 is my data
StandardAnalyzer: Test test1 test2 my data
I'd like to have: Test_test_test2 my data


Please help.


Thanks,


Anh Ngo


-----Original Message-----
From: Chris Hostetter
Sent: Wednesday, July 19, 2006 12:25 PM
To: java-user@lucene.apache.org
Subject: Re: BooleanQuery question


: If I search with boolQuery, Lucene doesn't find anything.
: If I modify by hand the query from "+(-(FILE:abstract.htm))
: +(PATH:/bssrs)" to "-(FILE:abstract.htm) +(PATH:/bssrs)", Lucene find
: the correct list of document.
:
: Does somebody know why ?

you can't have a boolean query containing only MUST_NOT clauses (which
is
what (-(FILE:abstract.htm)) is. it matches no documents, so the
mandatory
qualification on it causes the query to fail for all docs.


:
: Thanks in advance,
:
: Nicolas
:
:
:
:
:
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-user-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Daniel Naber at Jul 21, 2006 at 6:47 pm

    On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

    The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
    token.  Is there a way I can make StandardAnalyzer don't tokenize for
    "_" or any given characters?
    You need to add "_" to the #LETTER definition in StandardTokenizer.jj, then
    rebuild StandardTokenizer.java using the appropriate and task.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ngo, Anh \(ISS Southfield\) at Jul 21, 2006 at 7:44 pm
    What is #LETTER definition in SnardarTokernize.jj?


    I saw:
    <#P: ("_"|"-"|"/"|"."|",") >
    <#HAS_DIGIT: // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
    >


    Should I remove "_" and recompile the source code?

    Sincerely,


    Anh Ngo

    -----Original Message-----
    From: Daniel Naber
    Sent: Friday, July 21, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question
    On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

    The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
    token.  Is there a way I can make StandardAnalyzer don't tokenize for
    "_" or any given characters?
    You need to add "_" to the #LETTER definition in StandardTokenizer.jj, then
    rebuild StandardTokenizer.java using the appropriate and task.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2006 at 7:51 pm
    I do not beleive so. If you look above you will see that #P is only used
    when looking for a num: a host ip, a phone number, etc. You will be removing
    that ability to recognize a "_" while rooting those tokens out. It will
    still be parsed when tokenizing an EMAIL as well. I dont think this is the
    behavior you want.

    - Mark
    On 7/21/06, Ngo, Anh (ISS Southfield) wrote:


    What is #LETTER definition in SnardarTokernize.jj?


    I saw:
    <#P: ("_"|"-"|"/"|"."|",") >
    <#HAS_DIGIT: // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*

    Should I remove "_" and recompile the source code?

    Sincerely,


    Anh Ngo

    -----Original Message-----
    From: Daniel Naber
    Sent: Friday, July 21, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question
    On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

    The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
    token. Is there a way I can make StandardAnalyzer don't tokenize for
    "_" or any given characters?
    You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
    then
    rebuild StandardTokenizer.java using the appropriate and task.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2006 at 8:05 pm
    I take it back. Probably exactley what you want. Watch out if you're not
    compiling all of lucene...you need to avoid a ParserException using ant if
    you try to just extract the Standard Analyzer package (the recommended
    approach).

    On 7/21/06, Mark Miller wrote:

    I do not beleive so. If you look above you will see that #P is only used
    when looking for a num: a host ip, a phone number, etc. You will be removing
    that ability to recognize a "_" while rooting those tokens out. It will
    still be parsed when tokenizing an EMAIL as well. I dont think this is the
    behavior you want.

    - Mark

    On 7/21/06, Ngo, Anh (ISS Southfield) wrote:


    What is #LETTER definition in SnardarTokernize.jj?


    I saw:
    <#P: ("_"|"-"|"/"|"."|",") >
    <#HAS_DIGIT: // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*

    Should I remove "_" and recompile the source code?

    Sincerely,


    Anh Ngo

    -----Original Message-----
    From: Daniel Naber [mailto: lucenelist2005@danielnaber.de]
    Sent: Friday, July 21, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question
    On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

    The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
    token. Is there a way I can make StandardAnalyzer don't tokenize for
    "_" or any given characters?
    You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
    then
    rebuild StandardTokenizer.java using the appropriate and task.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ngo, Anh \(ISS Southfield\) at Jul 21, 2006 at 7:59 pm
    Hello Mark,


    Please show me how to add "-" to #LETTER definition


    Thanks,


    Anh Ngo

    -----Original Message-----
    From: Mark Miller
    Sent: Friday, July 21, 2006 3:51 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question

    I do not beleive so. If you look above you will see that #P is only used
    when looking for a num: a host ip, a phone number, etc. You will be
    removing
    that ability to recognize a "_" while rooting those tokens out. It will
    still be parsed when tokenizing an EMAIL as well. I dont think this is
    the
    behavior you want.

    - Mark
    On 7/21/06, Ngo, Anh (ISS Southfield) wrote:


    What is #LETTER definition in SnardarTokernize.jj?


    I saw:
    <#P: ("_"|"-"|"/"|"."|",") >
    <#HAS_DIGIT: // at least one
    digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*

    Should I remove "_" and recompile the source code?

    Sincerely,


    Anh Ngo

    -----Original Message-----
    From: Daniel Naber
    Sent: Friday, July 21, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question
    On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

    The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as
    a
    token. Is there a way I can make StandardAnalyzer don't tokenize for
    "_" or any given characters?
    You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
    then
    rebuild StandardTokenizer.java using the appropriate and task.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2006 at 8:09 pm
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]

    becomes
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u002d"
    ]
    On 7/21/06, Ngo, Anh (ISS Southfield) wrote:


    Hello Mark,


    Please show me how to add "-" to #LETTER definition


    Thanks,


    Anh Ngo

    -----Original Message-----
    From: Mark Miller
    Sent: Friday, July 21, 2006 3:51 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question

    I do not beleive so. If you look above you will see that #P is only used
    when looking for a num: a host ip, a phone number, etc. You will be
    removing
    that ability to recognize a "_" while rooting those tokens out. It will
    still be parsed when tokenizing an EMAIL as well. I dont think this is
    the
    behavior you want.

    - Mark
    On 7/21/06, Ngo, Anh (ISS Southfield) wrote:


    What is #LETTER definition in SnardarTokernize.jj?


    I saw:
    <#P: ("_"|"-"|"/"|"."|",") >
    <#HAS_DIGIT: // at least one
    digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*

    Should I remove "_" and recompile the source code?

    Sincerely,


    Anh Ngo

    -----Original Message-----
    From: Daniel Naber
    Sent: Friday, July 21, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question
    On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

    The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as
    a
    token. Is there a way I can make StandardAnalyzer don't tokenize for
    "_" or any given characters?
    You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
    then
    rebuild StandardTokenizer.java using the appropriate and task.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doron Cohen at Jul 21, 2006 at 8:31 pm
    "\u002d" would add "-".
    Originally request was for "_" - "\u005f"


    "Mark Miller" <markrmiller@gmail.com> wrote on 21/07/2006 13:09:28:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]

    becomes
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u002d"
    ]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ngo, Anh \(ISS Southfield\) at Jul 21, 2006 at 9:22 pm
    I did try it and recompile the whole package but it did not work

    My #LETTER is:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u005f",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]
    >

    Or:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u005f"
    ]
    >

    Please help.



    Anh Ngo

    -----Original Message-----
    From: Doron Cohen
    Sent: Friday, July 21, 2006 4:30 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question

    "\u002d" would add "-".
    Originally request was for "_" - "\u005f"


    "Mark Miller" <markrmiller@gmail.com> wrote on 21/07/2006 13:09:28:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]

    becomes
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u002d"
    ]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 21, 2006 at 9:33 pm

    Ngo, Anh (ISS Southfield) wrote:
    I did try it and recompile the whole package but it did not work

    My #LETTER is:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u005f",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]
    Or:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u005f"
    ]
    Please help.



    Anh Ngo

    -----Original Message-----
    From: Doron Cohen
    Sent: Friday, July 21, 2006 4:30 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question

    "\u002d" would add "-".
    Originally request was for "_" - "\u005f"


    "Mark Miller" <markrmiller@gmail.com> wrote on 21/07/2006 13:09:28:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]

    becomes
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u002d"
    ]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    What failed? Error messages? You have JavaCC? Any info? Psychic power
    don't fail me now...


    -mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ngo, Anh \(ISS Southfield\) at Jul 21, 2006 at 9:42 pm
    It works now.

    Thank you very much.

    I forgot to run javacc for the StandardTokenizer.jj


    Sincerely,



    Anh Ngo



    -----Original Message-----
    From: Mark Miller
    Sent: Friday, July 21, 2006 5:33 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question

    Ngo, Anh (ISS Southfield) wrote:
    I did try it and recompile the whole package but it did not work

    My #LETTER is:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u005f",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]
    Or:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u005f"
    ]
    Please help.



    Anh Ngo

    -----Original Message-----
    From: Doron Cohen
    Sent: Friday, July 21, 2006 4:30 PM
    To: java-user@lucene.apache.org
    Subject: Re: StandardAnalyzer question

    "\u002d" would add "-".
    Originally request was for "_" - "\u005f"


    "Mark Miller" <markrmiller@gmail.com> wrote on 21/07/2006 13:09:28:
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff"
    ]

    becomes
    < #LETTER: // unicode letters
    [
    "\u0041"-"\u005a",
    "\u0061"-"\u007a",
    "\u00c0"-"\u00d6",
    "\u00d8"-"\u00f6",
    "\u00f8"-"\u00ff",
    "\u0100"-"\u1fff",
    "\u002d"
    ]


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    What failed? Error messages? You have JavaCC? Any info? Psychic power
    don't fail me now...


    -mark

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 21, '06 at 2:16p
activeJul 21, '06 at 9:42p
posts11
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase