FAQ
Hi folks!

The topic says it all: I want to modify the StandardAnalyzer so that it also
splits words after punctuation characters (.,: etc.) that are NOT followed
by a whitespace character, in addition to punctuation characters that ARE
followed by whitespace.

Of course i've looked at StandardTokenizer.jj, but I don't quite get it. The
recursive nature of the grammar bends my mind.

Can someone smarter than me help here? I'd be most thankful!
Regards,


Michael


--
Michael Böckling
Java Engineer
dmc digital media center GmbH
Rommelstraße 11
70376 Stuttgart (Germany)
Telefon: +49 711 601747-0
Telefax: +49 711 601747-141
E-Mail: [email protected]
Internet: www.dmc.de

Handelsregister: AG Stuttgart HRB 18974
Geschäftsführer: Andreas Magg, Daniel Rebhorn, Andreas Schwend

---------------------------------------------
Besseres E-Business.
dmc ist die kreative Vernetzung von Agentur, Systemhaus und Service. Seit
über 10 Jahren entwickeln und realisieren wir zukunftweisende und
erfolgreiche E-Business-Lösungen. Zu unseren langjährigen Kunden zählen
neckermann.de, Kodak und Telekom Training.

dmc auf Platz 8 im aktuellen New Media Service Ranking.
Als inhabergeführte und netzwerkunabhängige Agentur gehören wir mit einem
Umsatz von 13,50 Mio. Euro zu den Top 10 der erfolgreichsten New Media
Dienstleister in Deutschland.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Erick Erickson at May 29, 2007 at 5:35 pm
    Well, one possibility is to do something simpler. Rather than
    modifying StandardAnalyzer, modify the input stream. That is,
    substitute spaces for punctuation NOT followed by whitespace
    and then just let the analyzer handle the result.

    For that matter, if you're going to alter the input stream
    before giving it to the analyzer, you can then use pretty
    simple analyzers unless you need some of the
    other characteristics of StandardAnalyzer....

    Or not....

    Erick
    On 5/29/07, Michael Böckling wrote:

    Hi folks!

    The topic says it all: I want to modify the StandardAnalyzer so that it
    also
    splits words after punctuation characters (.,: etc.) that are NOT followed
    by a whitespace character, in addition to punctuation characters that ARE
    followed by whitespace.

    Of course i've looked at StandardTokenizer.jj, but I don't quite get it.
    The
    recursive nature of the grammar bends my mind.

    Can someone smarter than me help here? I'd be most thankful!
    Regards,


    Michael


    --
    Michael Böckling
    Java Engineer
    dmc digital media center GmbH
    Rommelstraße 11
    70376 Stuttgart (Germany)
    Telefon: +49 711 601747-0
    Telefax: +49 711 601747-141
    E-Mail: [email protected]
    Internet: www.dmc.de

    Handelsregister: AG Stuttgart HRB 18974
    Geschäftsführer: Andreas Magg, Daniel Rebhorn, Andreas Schwend

    ---------------------------------------------
    Besseres E-Business.
    dmc ist die kreative Vernetzung von Agentur, Systemhaus und Service. Seit
    über 10 Jahren entwickeln und realisieren wir zukunftweisende und
    erfolgreiche E-Business-Lösungen. Zu unseren langjährigen Kunden zählen
    neckermann.de, Kodak und Telekom Training.

    dmc auf Platz 8 im aktuellen New Media Service Ranking.
    Als inhabergeführte und netzwerkunabhängige Agentur gehören wir mit einem
    Umsatz von 13,50 Mio. Euro zu den Top 10 der erfolgreichsten New Media
    Dienstleister in Deutschland.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Steven Rowe at May 29, 2007 at 5:55 pm
    Hi Michael,

    Michael Böckling wrote:
    Hi folks!

    The topic says it all: I want to modify the StandardAnalyzer so that it also
    splits words after punctuation characters (.,: etc.) that are NOT followed
    by a whitespace character, in addition to punctuation characters that ARE
    followed by whitespace.

    Of course i've looked at StandardTokenizer.jj, but I don't quite get it. The
    recursive nature of the grammar bends my mind.

    Can someone smarter than me help here?
    Um, that probably disqualifies me, but anyway...

    There are several regexes in StandardTokenizer.jj that generate tokens
    containing punctuation. You should be able to selectively comment them
    out to achieve what you want:

    1. Acronyms:
    <ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
    2. Company names:
    <COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
    3. Email addresses:
    <EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
    (("."|"-") <ALPHANUM>)+ >

    4. Hostnames:
    <HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
    5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:
    <NUM: (<ALPHANUM> <P> <HAS_DIGIT>
    <HAS_DIGIT> <P> <ALPHANUM>
    <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
    <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
    <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
    )
    >
    <#P: ("_"|"-"|"/"|"."|",") >
    <#HAS_DIGIT: // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
    >


    Steve

    --
    Steve Rowe
    Center for Natural Language Processing
    http://www.cnlp.org/tech/lucene.asp

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 29, '07 at 3:43p
activeMay 29, '07 at 5:55p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase