FAQ
Hey all,

I'm somewhat new to Lucene. Meaning I used it some time ago for a parser we
wrote to tokenize a document into word grams.

the approach I took was simple as follows:

1. extended the lucene Analyzer
2. In the tokenStream method use ShingleMatrixFilter. Passed in the
standard tokenizer, and shingle min/max/splitter.

This worked pretty well for us. Now we would like to tokenize hangul/korean
into word grams.

I'm curious others have done something similar and would share their
experience. Any pointers to get started with this would be great.

Thanks.

Search Discussions

  • Simon Willnauer at Feb 19, 2011 at 11:24 pm
    Hey,

    I am not an expert on this but I think you should look into
    CJKAnalyzer / CJKTokenizer

    simon
    On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser wrote:
    Hey all,

    I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
    wrote to tokenize a document into word grams.

    the approach I took was simple as follows:

    1. extended the lucene Analyzer
    2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
    standard tokenizer, and shingle min/max/splitter.

    This worked pretty well for us.  Now we would like to tokenize hangul/korean
    into word grams.

    I'm curious others have done something similar and would share their
    experience.  Any pointers to get started with this would be great.

    Thanks.
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 17, '11 at 7:06p
activeFeb 19, '11 at 11:24p
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase