I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch
to StandardTokenizerImpl, understandably it hasn't been incoroprated
into Lucene (yet) but I need it for the project Im working on. So would
you recommend keeping the same class name, and just putting in the
classpath before the lucene.jar, or creating a new Tokenizer,Impl and
Jflex file in my own projects package space.
Also, the StandardTokenizerImpl.jflex file states it should be compiled
with Java 1.4 not a later JDK, is this just for backwards compatability
? Because the indexes will be built afresh with this project would I
actually get a better results if I used a later JVM, the project has to
deal with indexing text which can be in any language and I'm hoping
using the latest JVM may solve some mapping problems with Japanese,
Hebrew and Korean that I don't really understand. Also our build process
uses Maven (not ant) and code is built using source 1.6 so its going to
be a pain to configure Maven to deal with this class differently.
thanks Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Best way to create own version of StandardTokenizer ?
| Tweet |
|
Search Discussions
-
Robert Muir at Sep 4, 2009 at 4:03 pm ⇧
i do not think you will really get better results, but it depends whatOn Fri, Sep 4, 2009 at 11:18 AM, Paul Taylorwrote:
I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch to
StandardTokenizerImpl, understandably it hasn't been incoroprated into
Lucene (yet) but I need it for the project Im working on. So would you
recommend keeping the same class name, and just putting in the classpath
before the lucene.jar, or creating a new Tokenizer,Impl and Jflex file in my
own projects package space.
i would recommend creating one in your own package space.
Also, the StandardTokenizerImpl.jflex file states it should be compiled with
Java 1.4 not a later JDK, is this just for backwards compatability ? Because
the indexes will be built afresh with this project would I actually get a
better results if I used a later JVM, the project has to deal with indexing
text which can be in any language and I'm hoping using the latest JVM may
solve some mapping problems with Japanese, Hebrew and Korean that I don't
really understand.
your issue is (can you elaborate?)
upgrading from 1.4 -> 1.6 will bump your unicode version from 3 to 4.
you can see a list of the changes here:
http://www.unicode.org/versions/Unicode4.0.0/
--
Robert Muir
rcmuir@gmail.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org -
Paul Taylor at Sep 4, 2009 at 4:55 pm ⇧
Things like:Robert Muir wrote:
On Fri, Sep 4, 2009 at 11:18 AM, Paul Taylorwrote:I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch toi would recommend creating one in your own package space.
StandardTokenizerImpl, understandably it hasn't been incoroprated into
Lucene (yet) but I need it for the project Im working on. So would you
recommend keeping the same class name, and just putting in the classpath
before the lucene.jar, or creating a new Tokenizer,Impl and Jflex file in my
own projects package space.Also, the StandardTokenizerImpl.jflex file states it should be compiled withi do not think you will really get better results, but it depends what
Java 1.4 not a later JDK, is this just for backwards compatability ? Because
the indexes will be built afresh with this project would I actually get a
better results if I used a later JVM, the project has to deal with indexing
text which can be in any language and I'm hoping using the latest JVM may
solve some mapping problems with Japanese, Hebrew and Korean that I don't
really understand.
your issue is (can you elaborate?)
upgrading from 1.4 -> 1.6 will bump your unicode version from 3 to 4.
you can see a list of the changes here:
http://www.unicode.org/versions/Unicode4.0.0/
http://bugs.musicbrainz.org/ticket/1006
http://bugs.musicbrainz.org/ticket/5311
http://bugs.musicbrainz.org/ticket/4827
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
-
Robert Muir at Sep 4, 2009 at 5:28 pm ⇧
Paul, thanks for the examples. In my opinion, only one of these is a
tokenizer problem :)
none of these will be affected by a unicode upgrade.
in this case, it appears you want to do script conversion, and it
appears from the ticket you are familiar with the details of this one
:)
one approach you could do (requiring 2.9) would be to use the new
CharFilter mechanism.
there is even a set of mappings defined here:
https://issues.apache.org/jira/secure/attachment/12408724/japanese-h-to-k-mapping.txt
but these are static mappings and may or may not handle all the cases
you care about.
another approach is using ibm ICU library for this case, as the
builtin Katakana-Hiragana works well.
you don't need to write the rules, as its built in, but if you are
curious they are defined here:
http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
if CharFilter/the static mappings I described do not meet your
requirements, and you want a filter that does this via the rules
above, I can give you some code.
finally, you could write a tokenfilter in java code to do this.
in this case, it appears you want to do fullwidth-halfwidth conversion
(hard to tell from the ticket but it claims that solves the issue)
you could use a similar CharFilter approach as I described above for this one.
alternatively, you could write java code. this kind of mapping is done
within the CJKTokenizer in Lucene's contrib, and you could steal some
code from there.
but a different way to look at this, is that its just one example of
Unicode normalization (compatibility decomposition)
so you could say, implement a tokenfilter that normalizes your text to
NFKC and solve this problem, as well as a bunch of other issues in a
bunch of other languages.
if you want code to do this, there are several open jira tickets in
lucene with different implementations.
this is a tokenization issue. its also not unicode standard (as really
geresh/gershayim etc should be used).
in the unicode standard (uax #29 segmentation), this issue is
specifically mentioned:
For Hebrew, a tailoring may include a double quotation mark between
letters, because legacy data may contain that in place of U+05F4 (״)
gershayim. This can be done by adding double quotation mark to
MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
in a tailoring.
So the easiest way for you to get this, would be to modify jflex rules
for these characters to behave differently, perhaps only when
surrounded by hebrew context.
thanks for your feedback it inspired me to work some more on
LUCENE-1488 as its designed to handle all these cases out of box :)Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
--
Robert Muir
rcmuir@gmail.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org -
Paul Taylor at Sep 4, 2009 at 7:41 pm ⇧
Thanks for taking the time to write that response, it will take me a bitRobert Muir wrote:
Paul, thanks for the examples. In my opinion, only one of these is a
tokenizer problem :)
none of these will be affected by a unicode upgrade.
of time to understand all this because I've ever used Lucene in quite a
simple basis, but some excellant ideas there and I will take a look at
your ICUAnalyser.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
-
Robert Muir at Sep 4, 2009 at 7:46 pm ⇧
Paul, no problem.
it is not fully functional right now (incomplete, bugs, etc). patch is
kinda for reading only :)
but if you have other similar issues on your project, feel free to
post links to them on that jira ticket.
this way we can look at what problems you have and if appropriate
maybe they can be incorporated in (maybe not there, but somewhere).On Fri, Sep 4, 2009 at 3:41 PM, Paul Taylor wrote:
Robert Muir wrote:Paul, thanks for the examples. In my opinion, only one of these is aThanks for taking the time to write that response, it will take me a bit of
tokenizer problem :)
none of these will be affected by a unicode upgrade.
time to understand all this because I've ever used Lucene in quite a simple
basis, but some excellant ideas there and I will take a look at your
ICUAnalyser.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
--
Robert Muir
rcmuir@gmail.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org -
Paul Taylor at Sep 7, 2009 at 10:08 am ⇧
I think we would like to implement the complete unicode rules, so if youRobert Muir wrote:
Paul, thanks for the examples. In my opinion, only one of these is a
tokenizer problem :)
none of these will be affected by a unicode upgrade.
another approach is using ibm ICU library for this case, as the
builtin Katakana-Hiragana works well.
you don't need to write the rules, as its built in, but if you are
curious they are defined here:
http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
if CharFilter/the static mappings I described do not meet your
requirements, and you want a filter that does this via the rules
above, I can give you some code.
could provide us with some code that would be great.in this case, it appears you want to do fullwidth-halfwidth conversionIf there is a mapping from halfwidth / fullwidth that would work so
(hard to tell from the ticket but it claims that solves the issue)
you could use a similar CharFilter approach as I described above for this one.
converted to fullwidth for indexing and searching, but having read the
details it would seem to convert a half width character you would have
to know you were looking at chinese (or korean/japanses ecetera) , but
as the Musicbrainz system supports any language and the user doesn't
specify the language being used when searching I cannot safetly
convert these characters because they may just be latin ecetera. However
when the entity is added to the database the language is specified so I
could do a conversion like this to ensure all chinese albums were always
indexed as full width, and then educate users to use full width charcters.alternatively, you could write java code. this kind of mapping is doneNot really going to work for me because need to handle all scripts, if I
within the CJKTokenizer in Lucene's contrib, and you could steal some
code from there.
ad extra chinese handling to tokenizer I expect I'll break handling for
other languagesbut a different way to look at this, is that its just one example ofI assume once again you have to know the script being used in order to
Unicode normalization (compatibility decomposition)
so you could say, implement a tokenfilter that normalizes your text to
NFKC and solve this problem, as well as a bunch of other issues in a
bunch of other languages.
if you want code to do this, there are several open jira tickets in
lucene with different implementations.
do thisthis is a tokenization issue. its also not unicode standard (as reallyI think there are two issues, firstly the data needs to be indexed to
geresh/gershayim etc should be used).
in the unicode standard (uax #29 segmentation), this issue is
specifically mentioned:
For Hebrew, a tailoring may include a double quotation mark between
letters, because legacy data may contain that in place of U+05F4 (״)
gershayim. This can be done by adding double quotation mark to
MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
in a tailoring.
So the easiest way for you to get this, would be to modify jflex rules
for these characters to behave differently, perhaps only when
surrounded by hebrew context.
always use gerhayim is this what you are suggesting I couldn't follow
how to change jflex.
Then its an issue for the query parser that the user uses a " for
searching but doesn't escape it, but I cannot automatically escape it
because it may not be Hebrew.
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
-
Robert Muir at Sep 7, 2009 at 2:19 pm ⇧
ok, I will followup... what version of lucene are you using, 2.9?I think we would like to implement the complete unicode rules, so if you
could provide us with some code that would be great.
...but having read theno, theres no language involved... why would you not simply apply the
details it would seem to convert a half width character you would have to
know you were looking at chinese (or korean/japanses ecetera) , but as the
Musicbrainz system supports any language and the user doesn't specify the
language being used when searching
filter all the time.
if i am looking at T (fullwidth character T), it should indexed as T
everytime (or later probably t if you are going to apply
lowercasefilter)I assume once again you have to know the script being used in order to dothis is ok, because normalization, if you want to do it that way, is
this
definitely not language dependent!
its not like collation, where you have a locale 'parameter', its a
language-independent process.
http://unicode.org/reports/tr15/I think there are two issues, firstly the data needs to be indexed to alwaysyou are right, for you there are a couple issues.
use gerhayim is this what you are suggesting I couldn't follow how to change
jflex.
first, i do not know what standardtokenizer does with
geresh/gershayim, forget about single quote/double quote.
but to fix the tokenization (gershayim example), you want to ensure
you do not split on these.
since this is used in hebrew acronym, i would modify the acronym rule to allow
[hebrew letter]+ (" | ״) [hebrew letter]+
next, if you want these to be indexed the same so that ארה"ב and ארה״ב
will match, you will need to create a tokenfilter
to standardize " to ״ for acronyms.Then its an issue for the query parser that the user uses a " for searchingyes, you have a queryparser parsing ambiguity because " is also the
but doesn't escape it, but I cannot automatically escape it because it may
not be Hebrew.
phrase operator.
I don't know what to recommend here off the top of my head... do you
allow phrase queries?
also as an fyi, when i say according to unicode they should be using
gershayim instead of double-quote, this is a little theoretical.
its not very user-friendly to expect users to use gershayim for input,
when its not even on hebrew keyboard layout...!
http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation
--
Robert Muir
rcmuir@gmail.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org -
Paul Taylor at Sep 7, 2009 at 2:47 pm ⇧
YesRobert Muir wrote:ok, I will followup... what version of lucene are you using, 2.9?
I think we would like to implement the complete unicode rules, so if you
could provide us with some code that would be great.
...I'm obviously misunderstanding I thought that Halfwidth was an encodingbut having read theno, theres no language involved... why would you not simply apply the
details it would seem to convert a half width character you would have to
know you were looking at chinese (or korean/japanses ecetera) , but as the
Musicbrainz system supports any language and the user doesn't specify the
language being used when searching
filter all the time.
if i am looking at T (fullwidth character T), it should indexed as T
everytime (or later probably t if you are going to apply
lowercasefilter)
to allow storing the most common Chinese characters in a single byte,
therefore the charcters would be read as different characters if you
assumed they were using the HalfWidth Encoding rather than Latin
Encoding. But are you saying Halfwidth characters are actually valid
Unicode characters with their own distinct unicode value so can just
use a CharFilter again to map these, please confirm.Oh I see , so we convert one to the other, but only when matchesI assume once again you have to know the script being used in order to dothis is ok, because normalization, if you want to do it that way, is
this
definitely not language dependent!
its not like collation, where you have a locale 'parameter', its a
language-independent process.
http://unicode.org/reports/tr15/I think there are two issues, firstly the data needs to be indexed to alwaysyou are right, for you there are a couple issues.
use gerhayim is this what you are suggesting I couldn't follow how to change
jflex.
first, i do not know what standardtokenizer does with
geresh/gershayim, forget about single quote/double quote.
but to fix the tokenization (gershayim example), you want to ensure
you do not split on these.
since this is used in hebrew acronym, i would modify the acronym rule to allow
[hebrew letter]+ (" | ״) [hebrew letter]+
next, if you want these to be indexed the same so that ארה"ב and ארה״ב
will match, you will need to create a tokenfilter
to standardize " to ״ for acronyms.
ACRONYM_TYPEYes we do , we allow full Lucene syntax if the 'Advanced Query' optionThen its an issue for the query parser that the user uses a " for searchingyes, you have a queryparser parsing ambiguity because " is also the
but doesn't escape it, but I cannot automatically escape it because it may
not be Hebrew.
phrase operator.
I don't know what to recommend here off the top of my head... do you
allow phrase queries?
is selected at http://musicbrainz.org/also as an fyi, when i say according to unicode they should be usingUnderstood, so I think users will continue to use the Double Quotes
gershayim instead of double-quote, this is a little theoretical.
its not very user-friendly to expect users to use gershayim for input,
when its not even on hebrew keyboard layout...!
http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation
Character in their searches
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
-
Robert Muir at Sep 7, 2009 at 3:04 pm ⇧
I will update LUCENE-1488 with the latest code so you can steal theOn Mon, Sep 7, 2009 at 10:47 AM, Paul Taylor wrote:
Robert Muir wrote:YesI think we would like to implement the complete unicode rules, so if youok, I will followup... what version of lucene are you using, 2.9?
could provide us with some code that would be great.
...
ICUTransformFilter from there.I'm obviously misunderstanding I thought that Halfwidth was an encoding toyes, fullwidth latin forms are distinct characters that have a different width:
allow storing the most common Chinese characters in a single byte, therefore
the charcters would be read as different characters if you assumed they were
using the HalfWidth Encoding rather than Latin Encoding. But are you saying
Halfwidth characters are actually valid Unicode characters with their own
distinct unicode value so can just use a CharFilter again to map these,
please confirm.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Fullwidth:]
so yes, you can use charfilter to map these to their standard latin forms.
beware though, there is a similar issue with halfwidth characters:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Halfwidth:]
example, ナ is halfwidth for the standard ナ
so you might want to include mappings for those as well.
the reason i brought up normalization is because this issue (width) is
a subset of things normalization can help with.
if you click on some of the characters in the two sets i provided you
will notice properties like 'toNFKC' containing the 'standardized'
form.
if in the future, you run into trouble with things in other languages
that aren't matching as expected,
because they aren't being considered the "same" when perhaps they
should, then a more general approach would be applying Unicode
normalization form NFKC in a TokenFilter.
--
Robert Muir
rcmuir@gmail.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Related Discussions
Discussion Navigation
| view | thread | post |
Discussion Overview
| group | java-user
|
| categories | lucene |
| posted | Sep 4, '09 at 3:19p |
| active | Sep 7, '09 at 3:04p |
| posts | 10 |
| users | 2 |
| website | lucene.apache.org |
