FAQ
Hi Cedric

You may try the CJKAnalyzer within the lucene sandbox. It doesn't give
a perfect solution for Chinese word segmentation, but will solve the
problem in your case.
On Nov 9, 2007 10:59 AM, Cedric Ho wrote:
Hi,

We are having an issue while indexing Chinese Documents in Lucene.

Some background first:
Since CJK languages doesn't have space between words, we first have to
determine the words from sentences. e.g.

a sentence containing characters ABC, it may be segmented into AB, C or A, BC.

the problem is sometimes there can be ambiguities in how the sentence
should be segmented. It is possible that
both AB, C and A, BC are valid segmentations.

In this cases we would like to index both segmentation into the index:

AB offset (0,1) position 0
C offset (2,2) position 1
A offset (0,0) position 0
BC offset (1,2) position 1

Now the problem is, when someone search using a PhraseQuery (AC) it
will find this line ABC because it match A (position 0) and C
(position 1).

Are there any ways to search for exact match using the offset
information instead of the position information ?

Best Regards,
Cedric

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 7 | next ›
Discussion Overview
groupjava-user @
categorieslucene
postedNov 9, '07 at 3:00a
activeNov 11, '07 at 2:45a
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase