On Sun, Feb 6, 2011 at 3:28 PM, Georger Araujo wrote:
Hi,
I started using Lucene a few weeks ago, and I must say I'm amazed. Hats off
to the developers and the community!
I'd like to write a custom analyzer whose only difference to
org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to discard
numeric tokens from the input. I've looked at the code and also at the
discussion in [1], but I'm lost about what is the simplest/cleanest way to
go.
What do you think?
Hi, in general the supplied analyzers are basically very generalHi,
I started using Lucene a few weeks ago, and I must say I'm amazed. Hats off
to the developers and the community!
I'd like to write a custom analyzer whose only difference to
org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to discard
numeric tokens from the input. I've looked at the code and also at the
discussion in [1], but I'm lost about what is the simplest/cleanest way to
go.
What do you think?
purpose examples.
So i would make your own analyzer: except using a tokenizer that
discards numbers (like lowercasetokenizer) instead of
standardtokenizer: something like LowerCaseTokenizer +
BrazilianStemFilter + Brazilian stopwords in a stopfilter.
I investigated this issue further and found out that StandardTokenizer is
actually desirable for my needs - I need to index emails, acronyms, etc. So
I'll use package org.apache.lucene.analysis.StopFilter as a starting point
to try and write a custom TokenFilter to discard numbers, then just extend
BrazilianAnalyzer and use this custom TokenFilter as the final filter in the
chain. I believe the end result will be simpler and cleaner this way.
Best regards,
Georger