FAQ
Hi,
I started using Lucene a few weeks ago, and I must say I'm amazed. Hats off
to the developers and the community!
I'd like to write a custom analyzer whose only difference to
org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to discard
numeric tokens from the input. I've looked at the code and also at the
discussion in [1], but I'm lost about what is the simplest/cleanest way to
go.
What do you think?

[1]
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200809.mbox/%[email protected]%3E

Best regards,

Georger

Search Discussions

  • Robert Muir at Feb 7, 2011 at 4:45 pm

    On Sun, Feb 6, 2011 at 3:28 PM, Georger Araujo wrote:
    Hi,
    I started using Lucene a few weeks ago, and I must say I'm amazed. Hats off
    to the developers and the community!
    I'd like to write a custom analyzer whose only difference to
    org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to discard
    numeric tokens from the input. I've looked at the code and also at the
    discussion in [1], but I'm lost about what is the simplest/cleanest way to
    go.
    What do you think?
    Hi, in general the supplied analyzers are basically very general
    purpose examples.

    So i would make your own analyzer: except using a tokenizer that
    discards numbers (like lowercasetokenizer) instead of
    standardtokenizer: something like LowerCaseTokenizer +
    BrazilianStemFilter + Brazilian stopwords in a stopfilter.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Georger Araujo at Feb 8, 2011 at 3:58 pm
    2011/2/7 Robert Muir <[email protected]>
    On Sun, Feb 6, 2011 at 3:28 PM, Georger Araujo wrote:
    Hi,
    I started using Lucene a few weeks ago, and I must say I'm amazed. Hats off
    to the developers and the community!
    I'd like to write a custom analyzer whose only difference to
    org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to discard
    numeric tokens from the input. I've looked at the code and also at the
    discussion in [1], but I'm lost about what is the simplest/cleanest way to
    go.
    What do you think?
    Hi, in general the supplied analyzers are basically very general
    purpose examples.

    So i would make your own analyzer: except using a tokenizer that
    discards numbers (like lowercasetokenizer) instead of
    standardtokenizer: something like LowerCaseTokenizer +
    BrazilianStemFilter + Brazilian stopwords in a stopfilter.

    Hi,
    I investigated this issue further and found out that StandardTokenizer is
    actually desirable for my needs - I need to index emails, acronyms, etc. So
    I'll use package org.apache.lucene.analysis.StopFilter as a starting point
    to try and write a custom TokenFilter to discard numbers, then just extend
    BrazilianAnalyzer and use this custom TokenFilter as the final filter in the
    chain. I believe the end result will be simpler and cleaner this way.
    Best regards,

    Georger

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 6, '11 at 8:29p
activeFeb 8, '11 at 3:58p
posts3
users2
websitelucene.apache.org

2 users in discussion

Georger Araujo: 2 posts Robert Muir: 1 post

People

Translate

site design / logo © 2023 Grokbase