FAQ
Support all of unicode in StandardTokenizer
-------------------------------------------

Key: LUCENE-2847
URL: https://issues.apache.org/jira/browse/LUCENE-2847
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Reporter: Robert Muir
Fix For: 3.1, 4.0
Attachments: LUCENE-2847.patch

StandardTokenizer currently only supports the BMP.

If it encounters characters outside of the BMP, it just discards them...
it should instead implement fully implement UAX#29 across all of unicode.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Robert Muir (JIRA) at Jan 5, 2011 at 6:43 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2847:
    --------------------------------

    Attachment: LUCENE-2847.patch

    Here's a patch... I added a simple test.

    I'm sure it can be beautified etc.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 5, 2011 at 11:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978027#action_12978027 ]

    Steven Rowe commented on LUCENE-2847:
    -------------------------------------

    JFlex generates fine, everything compiles, all tests pass.

    If we add a target in {{modules/analysis/icu/build.xml}} to run {{GenerateJFlexSupplementaryMacros#main()}}, maybe named {{gen-stdtok-supp-macros}}, the {{jflex}} target in {{modules/analysis/common/build.xml}} could use a {{<subant>}} to call it and auto-generate {{SUPPLEMENTARY.jflex-macro}}, no?

    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 5, 2011 at 11:29 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978037#action_12978037 ]

    Robert Muir commented on LUCENE-2847:
    -------------------------------------

    {quote}
    If we add a target in modules/analysis/icu/build.xml to run GenerateJFlexSupplementaryMacros#main(), maybe named gen-stdtok-supp-macros, the jflex target in modules/analysis/common/build.xml could use a <subant> to call it and auto-generate SUPPLEMENTARY.jflex-macro, no?
    {quote}

    Yeah, i think we could do something like this. We could also consolidate tools, because in general i would rather all the analyzers
    be consolidated, they are only split up due to dependencies/large files etc. But tools are different, its just to assist the build.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 5:08 am
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe updated LUCENE-2847:
    --------------------------------

    Attachment: LUCENE-2847.patch

    New patch, with the following changes:

    # Added a new target {{gen-uax29-supp-macros}} to {{modules/analysis/icu/build.xml}}, and a {{<subant>}} call to it from the {{jflex}} task in {{modules/analysis/common/build.xml}}.
    # Included SUPPLEMENTARY.jflex-macro}} in {{UAX29URLEmailTokenizer.jflex}} in the same way as it is included in {{StandardTokenizer.jflex}}
    # Copied the simple supplementary characters test from {{TestStandardAnalyzer.java}} to {{TestUAX29URLEmailTokenizer.java}}.
    # Modified the CHANGES.txt entry for the UAX#29 issues to include a reference to this issue.

    All tests pass.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 5:16 am
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe updated LUCENE-2847:
    --------------------------------

    Attachment: LUCENE-2847.patch

    Removed the WARNING from the {{UAX29URLEmailTokenizer}} class javadocs about Unicode supplementary character non-coverage.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 5:18 am
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978130#action_12978130 ]

    Steven Rowe edited comment on LUCENE-2847 at 1/6/11 12:16 AM:
    --------------------------------------------------------------

    New patch, with the following changes:

    # Added a new target {{gen-uax29-supp-macros}} to {{modules/analysis/icu/build.xml}}, and a {{<subant>}} call to it from the {{jflex}} task in {{modules/analysis/common/build.xml}}.
    # Included {{SUPPLEMENTARY.jflex-macro}} in {{UAX29URLEmailTokenizer.jflex}} in the same way as it is included in {{StandardTokenizer.jflex}}
    # Copied the simple supplementary characters test from {{TestStandardAnalyzer.java}} to {{TestUAX29URLEmailTokenizer.java}}.
    # Modified the CHANGES.txt entry for the UAX#29 issues to include a reference to this issue.

    All tests pass.

    was (Author: steve_rowe):
    New patch, with the following changes:

    # Added a new target {{gen-uax29-supp-macros}} to {{modules/analysis/icu/build.xml}}, and a {{<subant>}} call to it from the {{jflex}} task in {{modules/analysis/common/build.xml}}.
    # Included SUPPLEMENTARY.jflex-macro}} in {{UAX29URLEmailTokenizer.jflex}} in the same way as it is included in {{StandardTokenizer.jflex}}
    # Copied the simple supplementary characters test from {{TestStandardAnalyzer.java}} to {{TestUAX29URLEmailTokenizer.java}}.
    # Modified the CHANGES.txt entry for the UAX#29 issues to include a reference to this issue.

    All tests pass.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 5:22 am
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978141#action_12978141 ]

    Steven Rowe commented on LUCENE-2847:
    -------------------------------------

    bq. We could also consolidate tools, because in general i would rather all the analyzers be consolidated, they are only split up due to dependencies/large files etc. But tools are different, its just to assist the build.

    How far would you go with this tools consolidation? All tools across the whole of Scenolunr? Or just the ones under {{modules/analysis/}}?
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 6, 2011 at 1:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978305#action_12978305 ]

    Robert Muir commented on LUCENE-2847:
    -------------------------------------

    bq. How far would you go with this tools consolidation? All tools across the whole of Scenolunr? Or just the ones under modules/analysis/?

    I just meant under the analyzers module... but lets leave this be, i also forgot we have no analyzers module in 3.x.

    i think we should commit your latest patch.

    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 1:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe reassigned LUCENE-2847:
    -----------------------------------

    Assignee: Steven Rowe
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Assignee: Steven Rowe
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 1:23 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978310#action_12978310 ]

    Steven Rowe commented on LUCENE-2847:
    -------------------------------------

    bq. i think we should commit your latest patch.

    OK, I'll commit shortly.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Assignee: Steven Rowe
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 2:57 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe resolved LUCENE-2847.
    ---------------------------------

    Resolution: Fixed

    Committed to trunk: r1055877, branch_3x: r1055904.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Assignee: Steven Rowe
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jan 6, 2011 at 3:41 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978369#action_12978369 ]

    Robert Muir commented on LUCENE-2847:
    -------------------------------------

    Thanks for taking care of this!

    I think the added files need svn:eol-style=native ?
    Also, I think we should add an ASL2 license to the generated macros?
    I noticed the TLD generator does this, but I forgot to do it here.
    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Assignee: Steven Rowe
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 7:41 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978469#action_12978469 ]

    Steven Rowe commented on LUCENE-2847:
    -------------------------------------

    {quote}
    I think the added files need svn:eol-style=native ?
    Also, I think we should add an ASL2 license to the generated macros?
    I noticed the TLD generator does this, but I forgot to do it here.
    {quote}

    Done: trunk: r1056014, branch r1056030

    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Assignee: Steven Rowe
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Jan 6, 2011 at 7:43 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978469#action_12978469 ]

    Steven Rowe edited comment on LUCENE-2847 at 1/6/11 2:41 PM:
    -------------------------------------------------------------

    {quote}
    I think the added files need svn:eol-style=native ?
    Also, I think we should add an ASL2 license to the generated macros?
    I noticed the TLD generator does this, but I forgot to do it here.
    {quote}

    Done: trunk: r1056014, branch_3x: r1056030


    was (Author: steve_rowe):
    {quote}
    I think the added files need svn:eol-style=native ?
    Also, I think we should add an ASL2 license to the generated macros?
    I noticed the TLD generator does this, but I forgot to do it here.
    {quote}

    Done: trunk: r1056014, branch r1056030

    Support all of unicode in StandardTokenizer
    -------------------------------------------

    Key: LUCENE-2847
    URL: https://issues.apache.org/jira/browse/LUCENE-2847
    Project: Lucene - Java
    Issue Type: Bug
    Components: Analysis
    Reporter: Robert Muir
    Assignee: Steven Rowe
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch


    StandardTokenizer currently only supports the BMP.
    If it encounters characters outside of the BMP, it just discards them...
    it should instead implement fully implement UAX#29 across all of unicode.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJan 5, '11 at 6:43p
activeJan 6, '11 at 7:43p
posts15
users1
websitelucene.apache.org

1 user in discussion

Steven Rowe (JIRA): 15 posts

People

Translate

site design / logo © 2022 Grokbase