FAQ
Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
------------------------------------------------------------

Key: LUCENE-2699
URL: https://issues.apache.org/jira/browse/LUCENE-2699
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Assignee: Steven Rowe
Priority: Minor
Fix For: 3.1, 4.0


Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.

Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Steven Rowe (JIRA) at Oct 13, 2010 at 4:33 am
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe updated LUCENE-2699:
    --------------------------------

    Attachment: LUCENE-2699.patch

    Patch upgrading UAX#29-based tokenizers to Unicode 6.0.0.
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Oct 13, 2010 at 8:43 am
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920475#action_12920475 ]

    Robert Muir commented on LUCENE-2699:
    -------------------------------------

    +1

    does the minimal jflex revision need to be bumped (currently r591 in READ_BEFORE_REGENERATING.txt) ?
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Oct 13, 2010 at 1:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920556#action_12920556 ]

    Steven Rowe commented on LUCENE-2699:
    -------------------------------------

    Yes, it should be bumped to r597. I post a patch including that shortly. Thanks for the review, Robert.
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Oct 13, 2010 at 2:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe updated LUCENE-2699:
    --------------------------------

    Attachment: LUCENE-2699.patch

    In this version of the patch, I bumped the minimum JFlex trunk revision in READ_BEFORE_REGENERATING.txt to 597. Also added a CHANGES.txt entry.
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch, LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Oct 13, 2010 at 5:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe updated LUCENE-2699:
    --------------------------------

    Attachment: LUCENE-2699.patch

    This version of the patch fixes mixed end-of-lines present in the previous versions - I installed http://www.apache.org/dev/svn-eol-style.txt - thanks Robert!
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch, LUCENE-2699.patch, LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Oct 15, 2010 at 4:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe updated LUCENE-2699:
    --------------------------------

    Attachment: LUCENE-2699.patch

    The previous patches were generated on WinVista under Cygwin using a native Windows client (SlikSVN) -- apparently that isn't a good idea - mixed end-of-line styles run rampant (and rampant, whatever it is, can't be good).

    This version of the patch *really* doesn't have mixed end-of-line styles. I used the Cygwin svn client this time, and all end-of-lines are LF.

    I plan on committing shortly.
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch, LUCENE-2699.patch, LUCENE-2699.patch, LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Oct 15, 2010 at 6:36 am
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Steven Rowe resolved LUCENE-2699.
    ---------------------------------

    Resolution: Fixed
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch, LUCENE-2699.patch, LUCENE-2699.patch, LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Steven Rowe (JIRA) at Oct 15, 2010 at 6:36 am
    [ https://issues.apache.org/jira/browse/LUCENE-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921248#action_12921248 ]

    Steven Rowe commented on LUCENE-2699:
    -------------------------------------

    Committed: trunk revision 1022826, branch_3x revision 1022831
    Update StandardTokenizer and UAX29Tokenizer to Unicode 6.0.0
    ------------------------------------------------------------

    Key: LUCENE-2699
    URL: https://issues.apache.org/jira/browse/LUCENE-2699
    Project: Lucene - Java
    Issue Type: Improvement
    Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
    Reporter: Steven Rowe
    Assignee: Steven Rowe
    Priority: Minor
    Fix For: 3.1, 4.0

    Attachments: LUCENE-2699.patch, LUCENE-2699.patch, LUCENE-2699.patch, LUCENE-2699.patch


    Newly released Unicode 6.0.0 contains some character property changes from the previous release (5.2.0) that affect word segmentation (UAX#29), and JFlex 1.5.0-SNAPSHOT now supports Unicode 6.0.0, so Lucene's UAX#29-based tokenizers should be updated accordingly.
    Note that the UAX#29 word break rules themselves did not change between Unicode versions 5.2.0 and 6.0.0.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedOct 13, '10 at 4:32a
activeOct 15, '10 at 6:36a
posts9
users1
websitelucene.apache.org

1 user in discussion

Steven Rowe (JIRA): 9 posts

People

Translate

site design / logo © 2022 Grokbase