FAQ
Ok, I've followed your advice and commented out some Lines in the NUM
section. It now works as espected, thanks a lot, I just tried and it does
what I wanted it to do now. It looks scary, but isn't that bad.

Thanks!

Regards,
Michael


-----Ursprüngliche Nachricht-----
Von: Steven Rowe
Gesendet: Dienstag, 29. Mai 2007 19:54
An: java-user@lucene.apache.org
Betreff: Re: Modifying StandardAnalyzer so that it also splits words
after pun ctuation characters that are not followed by whitespace


Hi Michael,

Michael Böckling wrote:
Hi folks!

The topic says it all: I want to modify the
StandardAnalyzer so that it also
splits words after punctuation characters (.,: etc.) that
are NOT followed
by a whitespace character, in addition to punctuation
characters that ARE
followed by whitespace.

Of course i've looked at StandardTokenizer.jj, but I don't
quite get it. The
recursive nature of the grammar bends my mind.

Can someone smarter than me help here?
Um, that probably disqualifies me, but anyway...

There are several regexes in StandardTokenizer.jj that generate tokens
containing punctuation. You should be able to selectively
comment them
out to achieve what you want:

1. Acronyms:
<ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
2. Company names:
<COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
3. Email addresses:
<EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
(("."|"-") <ALPHANUM>)+ >

4. Hostnames:
<HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
<HAS_DIGIT> <P> <ALPHANUM>
<ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
<HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
<ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
<HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
<ALPHANUM>)+
)
<#P: ("_"|"-"|"/"|"."|",") >
<#HAS_DIGIT: // at least one digit
(<LETTER>|<DIGIT>)*
<DIGIT>
(<LETTER>|<DIGIT>)*

Steve

--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMay 30, '07 at 9:58a
activeMay 30, '07 at 9:58a
posts1
users1
websitelucene.apache.org

1 user in discussion

Michael Böckling: 1 post

People

Translate

site design / logo © 2022 Grokbase