Grokbase Groups Lucene dev June 2009
FAQ

[jira] Created: (LUCENE-1702) Thai token type() bug

Robert Muir (JIRA)
Jun 19, 2009 at 3:34 pm
Thai token type() bug
---------------------

Key: LUCENE-1702
URL: https://issues.apache.org/jira/browse/LUCENE-1702
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Reporter: Robert Muir
Priority: Minor


While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.

i propose that alphanum be described a little bit differently in the grammar.
Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.

this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
reply

Search Discussions

7 responses

  • Steven Rowe (JIRA) at Jun 19, 2009 at 3:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721833#action_12721833 ]

    Steven Rowe commented on LUCENE-1702:
    -------------------------------------

    +1 (I was involved in perpetuating the Thai grammar hack)

    FWIW, JFlex 1.5, which hopefully will be released in the next few months, will have better Unicode support, including general category, script, and block property support, as well as the ability to select the Unicode version. This will simplify the grammar. (Note that JFlex 1.5-generated scanners will require Java 1.5, so we won't be using it in Lucene until after Lucene 3.0 has been released.)

    Thai token type() bug
    ---------------------

    Key: LUCENE-1702
    URL: https://issues.apache.org/jira/browse/LUCENE-1702
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Priority: Minor

    While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
    ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.
    i propose that alphanum be described a little bit differently in the grammar.
    Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.
    this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jun 19, 2009 at 4:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721840#action_12721840 ]

    Robert Muir commented on LUCENE-1702:
    -------------------------------------

    Steven I have been watching that jflex 1.5 branch with great anticipation :)

    Do you think it will support characters outside of the BMP?

    (My hope is that it might perform better than the ICU RBBI for some other things I am working on)
    Thai token type() bug
    ---------------------

    Key: LUCENE-1702
    URL: https://issues.apache.org/jira/browse/LUCENE-1702
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Priority: Minor

    While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
    ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.
    i propose that alphanum be described a little bit differently in the grammar.
    Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.
    this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jun 19, 2009 at 4:32 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721859#action_12721859 ]

    Steven Rowe commented on LUCENE-1702:
    -------------------------------------

    bq. Steven I have been watching that jflex 1.5 branch with great anticipation :)

    Cool! If you mention this on the jflex-devel mailing list, you may be able to help nudge Gerwin Klein (JFlex founder and main developer) into starting work on merging the 1.5 branch onto the trunk :)

    bq. Do you think it will support characters outside of the BMP?

    As you may already know, the 1.5 branch does not yet include above-BMP support. However, this is definitely a future goal.

    My guess is that 1.5.0 will be BMP-only, and that 1.5.X or 1.6 will add above-BMP support. (This is my guess because the Unicode properties code is present and functional in the branch now, but no work has yet been done to add above-BMP support.)
    Thai token type() bug
    ---------------------

    Key: LUCENE-1702
    URL: https://issues.apache.org/jira/browse/LUCENE-1702
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Priority: Minor

    While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
    ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.
    i propose that alphanum be described a little bit differently in the grammar.
    Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.
    this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jun 19, 2009 at 4:38 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721867#action_12721867 ]

    Robert Muir commented on LUCENE-1702:
    -------------------------------------

    Steven, even without >BMP support, 1.5 branch would make the grammar file more clear and maintainable.
    Otherwise, codepoint ranges must be used.

    I'll take your advice and send the nudge.

    I think for this issue it would be best to wait for the 1.5.0 version of jflex for clarity.
    I think even without >BMP support, we should be able to still function.
    ex: surrogate pairs with lead surrogate D840-D87F point to the SIP, so they should be typed as CJK.

    for reference (haven't looked at jflex), above-bmp support might require new data structures. I think ICU uses things like tries / compactarrays to deal with the fact you have thousands of codepoints with the same property value, etc.


    Thai token type() bug
    ---------------------

    Key: LUCENE-1702
    URL: https://issues.apache.org/jira/browse/LUCENE-1702
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Priority: Minor

    While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
    ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.
    i propose that alphanum be described a little bit differently in the grammar.
    Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.
    this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jun 19, 2009 at 5:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721877#action_12721877 ]

    Steven Rowe commented on LUCENE-1702:
    -------------------------------------

    bq. I think for this issue it would be best to wait for the 1.5.0 version of jflex for clarity.

    +0, in that the arrival time for 1.5.0 is unknown, but I'll defer to your judgment.

    bq. for reference (haven't looked at jflex), above-bmp support might require new data structures. I think ICU uses things like tries / compactarrays to deal with the fact you have thousands of codepoints with the same property value, etc.

    Thanks for the heads-up. The above-BMP property values for the currently supported properties are now encoded on the 1.5 branch as range pairs (they just aren't accessible yet because of the BMP limit). Since JFlex is a regular expression engine, code for handling large character sets (as sets of ranges) is already built-in, so I don't anticipate this will be a problem. The main thing will just be to switch from char to int for character representation.
    Thai token type() bug
    ---------------------

    Key: LUCENE-1702
    URL: https://issues.apache.org/jira/browse/LUCENE-1702
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Priority: Minor

    While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
    ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.
    i propose that alphanum be described a little bit differently in the grammar.
    Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.
    this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jun 19, 2009 at 5:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721884#action_12721884 ]

    Robert Muir commented on LUCENE-1702:
    -------------------------------------

    Steven, thanks for the information, and the range representation sounds interesting.

    So I'll let others comment if they want it to be fixed pre-1.5.0, in this case we could define macros in jflex that represent what we want, with comments indicating how they will be defined in the future jflex.
    Either way, a specific unicode version should be selected, with the macros defined from that unicode version or that unicode version specified to jflex 1.5.0... unicode 5.1 sounds good to me :)

    The matchVersion could be used to ensure that back compat always works.

    Thai token type() bug
    ---------------------

    Key: LUCENE-1702
    URL: https://issues.apache.org/jira/browse/LUCENE-1702
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Priority: Minor

    While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
    ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.
    i propose that alphanum be described a little bit differently in the grammar.
    Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.
    this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Aug 3, 2009 at 11:15 am
    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738280#action_12738280 ]

    Robert Muir commented on LUCENE-1702:
    -------------------------------------

    related to this issue, Steven has added support for unicode text segmentation properties to the 1.5 dev branch of jflex: http://sourceforge.net/mailarchive/message.php?msg_name=4A747D60.4090904%40odyssey.net

    we should be able to start prototyping a different definition of ALPHANUM, etc that solves this issue (and improves tokenization of many languages!)

    Thai token type() bug
    ---------------------

    Key: LUCENE-1702
    URL: https://issues.apache.org/jira/browse/LUCENE-1702
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Robert Muir
    Priority: Minor

    While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
    ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.
    i propose that alphanum be described a little bit differently in the grammar.
    Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.
    this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post

1 user in discussion

Robert Muir (JIRA): 8 posts