I'm somewhat new to Lucene. Meaning I used it some time ago for a parser we
wrote to tokenize a document into word grams.
the approach I took was simple as follows:
1. extended the lucene Analyzer
2. In the tokenStream method use ShingleMatrixFilter. Passed in the
standard tokenizer, and shingle min/max/splitter.
This worked pretty well for us. Now we would like to tokenize hangul/korean
into word grams.
I'm curious others have done something similar and would share their
experience. Any pointers to get started with this would be great.