section. It now works as espected, thanks a lot, I just tried and it does
what I wanted it to do now. It looks scary, but isn't that bad.
Thanks!
Regards,
Michael
-----Ursprüngliche Nachricht-----
Von: Steven Rowe
Gesendet: Dienstag, 29. Mai 2007 19:54
An: java-user@lucene.apache.org
Betreff: Re: Modifying StandardAnalyzer so that it also splits words
after pun ctuation characters that are not followed by whitespace
Hi Michael,
Michael Böckling wrote:
There are several regexes in StandardTokenizer.jj that generate tokens
containing punctuation. You should be able to selectively
comment them
out to achieve what you want:
1. Acronyms:
4. Hostnames:
)
<DIGIT>
(<LETTER>|<DIGIT>)*
Steve
--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------Von: Steven Rowe
Gesendet: Dienstag, 29. Mai 2007 19:54
An: java-user@lucene.apache.org
Betreff: Re: Modifying StandardAnalyzer so that it also splits words
after pun ctuation characters that are not followed by whitespace
Hi Michael,
Michael Böckling wrote:
Hi folks!
The topic says it all: I want to modify the
StandardAnalyzer so that it also
splits words after punctuation characters (.,: etc.) that
are NOT followed
by a whitespace character, in addition to punctuation
characters that ARE
followed by whitespace.
Of course i've looked at StandardTokenizer.jj, but I don't
quite get it. The
recursive nature of the grammar bends my mind.
Can someone smarter than me help here?
Um, that probably disqualifies me, but anyway...The topic says it all: I want to modify the
StandardAnalyzer so that it also
splits words after punctuation characters (.,: etc.) that
are NOT followed
by a whitespace character, in addition to punctuation
characters that ARE
followed by whitespace.
Of course i've looked at StandardTokenizer.jj, but I don't
quite get it. The
recursive nature of the grammar bends my mind.
Can someone smarter than me help here?
There are several regexes in StandardTokenizer.jj that generate tokens
containing punctuation. You should be able to selectively
comment them
out to achieve what you want:
1. Acronyms:
<ACRONYM: <ALPHA> "." (<ALPHA> ".")+ >
2. Company names:
<COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
3. Email addresses:
<EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
(("."|"-") <ALPHANUM>)+ >2. Company names:
<COMPANY: <ALPHA> ("&"|"@") <ALPHA> >
3. Email addresses:
<EMAIL: <ALPHANUM> (("."|"-"|"_") <ALPHANUM>)* "@" <ALPHANUM>
4. Hostnames:
<HOST: <ALPHANUM> ("." <ALPHANUM>)+ >
5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
<HAS_DIGIT> <P> <ALPHANUM>
<ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
<HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
<ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
<HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
<ALPHANUM>)+5. The <NUM>, <P> and <HAS_DIGIT> regexes, for IP addresses, etc.:
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>
<HAS_DIGIT> <P> <ALPHANUM>
<ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+
<HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+
<ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P>
<HAS_DIGIT>)+
<HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P>
)
<#P: ("_"|"-"|"/"|"."|",") >
<#HAS_DIGIT: // at least one digit
(<LETTER>|<DIGIT>)*<#HAS_DIGIT: // at least one digit
<DIGIT>
(<LETTER>|<DIGIT>)*
Steve
--
Steve Rowe
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org