This archived message has the CJKTokenizer code attached (there are some links in the code to material that describes the tokenization strategy).


You have to write your own analyzer that uses this tokenizer. See http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to write an analyzer.

here is one you could use:
package my.package;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKTokenizer;
import java.io.Reader;

public class CJKAnalyzer extends Analyzer {

public CJKAnalyzer() {

* Creates a TokenStream which tokenizes all the text in the provided Reader.
* @return A TokenStream built from a CJKTokenizer
public TokenStream tokenStream( String fieldName, Reader reader )
TokenStream result = new CJKTokenizer( reader );
result = new StopFilter(result, new String[] {""}); // CJKTokenizer emitts a "" sometimes, haven't been able to figure it out, so this is a workaround
return result;

Lastly, you have to package those things up and use them along with the core lucene code.

CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on indexing CJK languages would be a good thing to add. The existing one (http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexing&toc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=1011118 is useful to be aware of too.

Good luck,

-----Original Message-----
From: Avnish Midha
Sent: Wednesday, July 16, 2003 1:06 PM
To: Eric Isakson
Subject: CJK support in lucene

Hi Eric,

I read the description of the bug (#18933) reported by you on the apache site. I had a question related to this defect. In the description you have mentioned that CJK support should be included in the core build. Is there any other way we can enable the CJK support in the lucene search engine? Would be grateful to you if you could let me know of any such method of enabling CJK support in the serach engine.

Eagerly waiting for your reply.

Thanks & Regards,
Avnish Midha
Phone no.: +1-949-8852540

To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
postedJul 16, '03 at 5:38p
activeJul 16, '03 at 5:38p

1 user in discussion

Eric Isakson: 1 post



site design / logo © 2022 Grokbase