FAQ

Best way to create own version of StandardTokenizer ?

Paul Taylor
Sep 4, 2009 at 3:19 pm
I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch
to StandardTokenizerImpl, understandably it hasn't been incoroprated
into Lucene (yet) but I need it for the project Im working on. So would
you recommend keeping the same class name, and just putting in the
classpath before the lucene.jar, or creating a new Tokenizer,Impl and
Jflex file in my own projects package space.

Also, the StandardTokenizerImpl.jflex file states it should be compiled
with Java 1.4 not a later JDK, is this just for backwards compatability
? Because the indexes will be built afresh with this project would I
actually get a better results if I used a later JVM, the project has to
deal with indexing text which can be in any language and I'm hoping
using the latest JVM may solve some mapping problems with Japanese,
Hebrew and Korean that I don't really understand. Also our build process
uses Maven (not ant) and code is built using source 1.6 so its going to
be a pain to configure Maven to deal with this class differently.

thanks Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
reply

Search Discussions

9 responses

  • Robert Muir at Sep 4, 2009 at 4:03 pm

    On Fri, Sep 4, 2009 at 11:18 AM, Paul Taylorwrote:
    I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch to
    StandardTokenizerImpl, understandably it hasn't been incoroprated into
    Lucene (yet) but I need it for the project Im working on. So would you
    recommend keeping the same class name, and just putting in the classpath
    before the lucene.jar, or creating a new Tokenizer,Impl and Jflex file in my
    own projects package space.
    i would recommend creating one in your own package space.
    Also, the StandardTokenizerImpl.jflex file states it should be compiled with
    Java 1.4 not a later JDK, is this just for backwards compatability ? Because
    the indexes will be built afresh with this project  would I actually get a
    better results if I used a later JVM, the project has to deal with indexing
    text  which can be in any language and I'm hoping using the latest JVM may
    solve some mapping problems with Japanese, Hebrew and Korean that I don't
    really understand.
    i do not think you will really get better results, but it depends what
    your issue is (can you elaborate?)
    upgrading from 1.4 -> 1.6 will bump your unicode version from 3 to 4.
    you can see a list of the changes here:
    http://www.unicode.org/versions/Unicode4.0.0/


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at Sep 4, 2009 at 4:55 pm

    Robert Muir wrote:
    On Fri, Sep 4, 2009 at 11:18 AM, Paul Taylorwrote:
    I submitted this https://issues.apache.org/jira/browse/LUCENE-1787 patch to
    StandardTokenizerImpl, understandably it hasn't been incoroprated into
    Lucene (yet) but I need it for the project Im working on. So would you
    recommend keeping the same class name, and just putting in the classpath
    before the lucene.jar, or creating a new Tokenizer,Impl and Jflex file in my
    own projects package space.
    i would recommend creating one in your own package space.

    Also, the StandardTokenizerImpl.jflex file states it should be compiled with
    Java 1.4 not a later JDK, is this just for backwards compatability ? Because
    the indexes will be built afresh with this project would I actually get a
    better results if I used a later JVM, the project has to deal with indexing
    text which can be in any language and I'm hoping using the latest JVM may
    solve some mapping problems with Japanese, Hebrew and Korean that I don't
    really understand.
    i do not think you will really get better results, but it depends what
    your issue is (can you elaborate?)
    upgrading from 1.4 -> 1.6 will bump your unicode version from 3 to 4.
    you can see a list of the changes here:
    http://www.unicode.org/versions/Unicode4.0.0/

    Things like:

    http://bugs.musicbrainz.org/ticket/1006
    http://bugs.musicbrainz.org/ticket/5311
    http://bugs.musicbrainz.org/ticket/4827

    Paul


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Sep 4, 2009 at 5:28 pm
    Paul, thanks for the examples. In my opinion, only one of these is a
    tokenizer problem :)
    none of these will be affected by a unicode upgrade.
    in this case, it appears you want to do script conversion, and it
    appears from the ticket you are familiar with the details of this one
    :)

    one approach you could do (requiring 2.9) would be to use the new
    CharFilter mechanism.
    there is even a set of mappings defined here:
    https://issues.apache.org/jira/secure/attachment/12408724/japanese-h-to-k-mapping.txt
    but these are static mappings and may or may not handle all the cases
    you care about.

    another approach is using ibm ICU library for this case, as the
    builtin Katakana-Hiragana works well.
    you don't need to write the rules, as its built in, but if you are
    curious they are defined here:
    http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
    if CharFilter/the static mappings I described do not meet your
    requirements, and you want a filter that does this via the rules
    above, I can give you some code.

    finally, you could write a tokenfilter in java code to do this.
    in this case, it appears you want to do fullwidth-halfwidth conversion
    (hard to tell from the ticket but it claims that solves the issue)

    you could use a similar CharFilter approach as I described above for this one.

    alternatively, you could write java code. this kind of mapping is done
    within the CJKTokenizer in Lucene's contrib, and you could steal some
    code from there.

    but a different way to look at this, is that its just one example of
    Unicode normalization (compatibility decomposition)
    so you could say, implement a tokenfilter that normalizes your text to
    NFKC and solve this problem, as well as a bunch of other issues in a
    bunch of other languages.
    if you want code to do this, there are several open jira tickets in
    lucene with different implementations.
    this is a tokenization issue. its also not unicode standard (as really
    geresh/gershayim etc should be used).
    in the unicode standard (uax #29 segmentation), this issue is
    specifically mentioned:

    For Hebrew, a tailoring may include a double quotation mark between
    letters, because legacy data may contain that in place of U+05F4 (״)
    gershayim. This can be done by adding double quotation mark to
    MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
    in a tailoring.

    So the easiest way for you to get this, would be to modify jflex rules
    for these characters to behave differently, perhaps only when
    surrounded by hebrew context.


    thanks for your feedback it inspired me to work some more on
    LUCENE-1488 as its designed to handle all these cases out of box :)
    Paul


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at Sep 4, 2009 at 7:41 pm

    Robert Muir wrote:
    Paul, thanks for the examples. In my opinion, only one of these is a
    tokenizer problem :)
    none of these will be affected by a unicode upgrade.
    Thanks for taking the time to write that response, it will take me a bit
    of time to understand all this because I've ever used Lucene in quite a
    simple basis, but some excellant ideas there and I will take a look at
    your ICUAnalyser.

    Paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Sep 4, 2009 at 7:46 pm
    Paul, no problem.

    it is not fully functional right now (incomplete, bugs, etc). patch is
    kinda for reading only :)
    but if you have other similar issues on your project, feel free to
    post links to them on that jira ticket.
    this way we can look at what problems you have and if appropriate
    maybe they can be incorporated in (maybe not there, but somewhere).
    On Fri, Sep 4, 2009 at 3:41 PM, Paul Taylor wrote:
    Robert Muir wrote:
    Paul, thanks for the examples. In my opinion, only one of these is a
    tokenizer problem :)
    none of these will be affected by a unicode upgrade.
    Thanks for taking the time to write that response, it will take me a bit of
    time to understand all this because I've ever used Lucene in quite a simple
    basis, but some excellant ideas there and I will take a look at your
    ICUAnalyser.

    Paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at Sep 7, 2009 at 10:08 am

    Robert Muir wrote:
    Paul, thanks for the examples. In my opinion, only one of these is a
    tokenizer problem :)
    none of these will be affected by a unicode upgrade.


    another approach is using ibm ICU library for this case, as the
    builtin Katakana-Hiragana works well.
    you don't need to write the rules, as its built in, but if you are
    curious they are defined here:
    http://unicode.org/repos/cldr/trunk/common/transforms/Hiragana-Katakana.xml?rev=1.7&content-type=text/vnd.viewcvs-markup
    if CharFilter/the static mappings I described do not meet your
    requirements, and you want a filter that does this via the rules
    above, I can give you some code.
    I think we would like to implement the complete unicode rules, so if you
    could provide us with some code that would be great.
    in this case, it appears you want to do fullwidth-halfwidth conversion
    (hard to tell from the ticket but it claims that solves the issue)

    you could use a similar CharFilter approach as I described above for this one.
    If there is a mapping from halfwidth / fullwidth that would work so
    converted to fullwidth for indexing and searching, but having read the
    details it would seem to convert a half width character you would have
    to know you were looking at chinese (or korean/japanses ecetera) , but
    as the Musicbrainz system supports any language and the user doesn't
    specify the language being used when searching I cannot safetly
    convert these characters because they may just be latin ecetera. However
    when the entity is added to the database the language is specified so I
    could do a conversion like this to ensure all chinese albums were always
    indexed as full width, and then educate users to use full width charcters.
    alternatively, you could write java code. this kind of mapping is done
    within the CJKTokenizer in Lucene's contrib, and you could steal some
    code from there.
    Not really going to work for me because need to handle all scripts, if I
    ad extra chinese handling to tokenizer I expect I'll break handling for
    other languages
    but a different way to look at this, is that its just one example of
    Unicode normalization (compatibility decomposition)
    so you could say, implement a tokenfilter that normalizes your text to
    NFKC and solve this problem, as well as a bunch of other issues in a
    bunch of other languages.
    if you want code to do this, there are several open jira tickets in
    lucene with different implementations.
    I assume once again you have to know the script being used in order to
    do this
    this is a tokenization issue. its also not unicode standard (as really
    geresh/gershayim etc should be used).
    in the unicode standard (uax #29 segmentation), this issue is
    specifically mentioned:

    For Hebrew, a tailoring may include a double quotation mark between
    letters, because legacy data may contain that in place of U+05F4 (״)
    gershayim. This can be done by adding double quotation mark to
    MidLetter. U+05F3 (׳) HEBREW PUNCTUATION GERESH may also be included
    in a tailoring.

    So the easiest way for you to get this, would be to modify jflex rules
    for these characters to behave differently, perhaps only when
    surrounded by hebrew context.
    I think there are two issues, firstly the data needs to be indexed to
    always use gerhayim is this what you are suggesting I couldn't follow
    how to change jflex.
    Then its an issue for the query parser that the user uses a " for
    searching but doesn't escape it, but I cannot automatically escape it
    because it may not be Hebrew.


    Paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Sep 7, 2009 at 2:19 pm

    I think we would like to implement the complete unicode rules, so if you
    could provide us with some code that would be great.
    ok, I will followup... what version of lucene are you using, 2.9?

    ...
    but having read the
    details it would seem to convert a half width character you would have to
    know you were looking at chinese (or korean/japanses ecetera) , but as the
    Musicbrainz system supports any language and the user doesn't specify the
    language being used  when searching
    no, theres no language involved... why would you not simply apply the
    filter all the time.
    if i am looking at T (fullwidth character T), it should indexed as T
    everytime (or later probably t if you are going to apply
    lowercasefilter)
    I assume once again you have to know the script being used in order to do
    this
    this is ok, because normalization, if you want to do it that way, is
    definitely not language dependent!
    its not like collation, where you have a locale 'parameter', its a
    language-independent process.
    http://unicode.org/reports/tr15/
    I think there are two issues, firstly the data needs to be indexed to always
    use gerhayim is this what you are suggesting I couldn't follow how to change
    jflex.
    you are right, for you there are a couple issues.
    first, i do not know what standardtokenizer does with
    geresh/gershayim, forget about single quote/double quote.

    but to fix the tokenization (gershayim example), you want to ensure
    you do not split on these.
    since this is used in hebrew acronym, i would modify the acronym rule to allow

    [hebrew letter]+ (" | ״) [hebrew letter]+

    next, if you want these to be indexed the same so that ארה"ב and ארה״ב
    will match, you will need to create a tokenfilter
    to standardize " to ״ for acronyms.
    Then its an issue for the query parser that the user uses a " for searching
    but doesn't escape it, but I cannot automatically escape it because it may
    not be Hebrew.
    yes, you have a queryparser parsing ambiguity because " is also the
    phrase operator.
    I don't know what to recommend here off the top of my head... do you
    allow phrase queries?

    also as an fyi, when i say according to unicode they should be using
    gershayim instead of double-quote, this is a little theoretical.
    its not very user-friendly to expect users to use gershayim for input,
    when its not even on hebrew keyboard layout...!

    http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation

    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Taylor at Sep 7, 2009 at 2:47 pm

    Robert Muir wrote:
    I think we would like to implement the complete unicode rules, so if you
    could provide us with some code that would be great.
    ok, I will followup... what version of lucene are you using, 2.9?

    ...
    Yes
    but having read the
    details it would seem to convert a half width character you would have to
    know you were looking at chinese (or korean/japanses ecetera) , but as the
    Musicbrainz system supports any language and the user doesn't specify the
    language being used when searching
    no, theres no language involved... why would you not simply apply the
    filter all the time.
    if i am looking at T (fullwidth character T), it should indexed as T
    everytime (or later probably t if you are going to apply
    lowercasefilter)
    I'm obviously misunderstanding I thought that Halfwidth was an encoding
    to allow storing the most common Chinese characters in a single byte,
    therefore the charcters would be read as different characters if you
    assumed they were using the HalfWidth Encoding rather than Latin
    Encoding. But are you saying Halfwidth characters are actually valid
    Unicode characters with their own distinct unicode value so can just
    use a CharFilter again to map these, please confirm.
    I assume once again you have to know the script being used in order to do
    this
    this is ok, because normalization, if you want to do it that way, is
    definitely not language dependent!
    its not like collation, where you have a locale 'parameter', its a
    language-independent process.
    http://unicode.org/reports/tr15/

    I think there are two issues, firstly the data needs to be indexed to always
    use gerhayim is this what you are suggesting I couldn't follow how to change
    jflex.
    you are right, for you there are a couple issues.
    first, i do not know what standardtokenizer does with
    geresh/gershayim, forget about single quote/double quote.

    but to fix the tokenization (gershayim example), you want to ensure
    you do not split on these.
    since this is used in hebrew acronym, i would modify the acronym rule to allow

    [hebrew letter]+ (" | ״) [hebrew letter]+

    next, if you want these to be indexed the same so that ארה"ב and ארה״ב
    will match, you will need to create a tokenfilter
    to standardize " to ״ for acronyms.
    Oh I see , so we convert one to the other, but only when matches
    ACRONYM_TYPE
    Then its an issue for the query parser that the user uses a " for searching
    but doesn't escape it, but I cannot automatically escape it because it may
    not be Hebrew.
    yes, you have a queryparser parsing ambiguity because " is also the
    phrase operator.
    I don't know what to recommend here off the top of my head... do you
    allow phrase queries?
    Yes we do , we allow full Lucene syntax if the 'Advanced Query' option
    is selected at http://musicbrainz.org/
    also as an fyi, when i say according to unicode they should be using
    gershayim instead of double-quote, this is a little theoretical.
    its not very user-friendly to expect users to use gershayim for input,
    when its not even on hebrew keyboard layout...!

    http://en.wikipedia.org/wiki/Hebrew_keyboard#Inaccessible_punctuation
    Understood, so I think users will continue to use the Double Quotes
    Character in their searches

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Sep 7, 2009 at 3:04 pm

    On Mon, Sep 7, 2009 at 10:47 AM, Paul Taylor wrote:
    Robert Muir wrote:
    I think we would like to implement the complete unicode rules, so if you
    could provide us with some code that would be great.
    ok, I will followup... what version of lucene are you using, 2.9?

    ...
    Yes
    I will update LUCENE-1488 with the latest code so you can steal the
    ICUTransformFilter from there.
    I'm obviously misunderstanding I thought that Halfwidth  was an encoding to
    allow storing the most common Chinese characters in a single byte, therefore
    the charcters would be read as different characters if you assumed they were
    using the HalfWidth Encoding rather than Latin Encoding. But are you saying
    Halfwidth characters are actually valid Unicode characters with their own
    distinct unicode value   so can just  use a CharFilter again to map these,
    please confirm.
    yes, fullwidth latin forms are distinct characters that have a different width:
    http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Fullwidth:]

    so yes, you can use charfilter to map these to their standard latin forms.

    beware though, there is a similar issue with halfwidth characters:
    http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Halfwidth:]
    example, ナ is halfwidth for the standard ナ
    so you might want to include mappings for those as well.

    the reason i brought up normalization is because this issue (width) is
    a subset of things normalization can help with.
    if you click on some of the characters in the two sets i provided you
    will notice properties like 'toNFKC' containing the 'standardized'
    form.

    if in the future, you run into trouble with things in other languages
    that aren't matching as expected,
    because they aren't being considered the "same" when perhaps they
    should, then a more general approach would be applying Unicode
    normalization form NFKC in a TokenFilter.

    --
    Robert Muir
    rcmuir@gmail.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post

2 users in discussion

Robert Muir: 5 posts Paul Taylor: 5 posts