Grokbase Groups Lucene dev April 2010
FAQ
Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
---------------------------------------------------------------------------------------------------------------------------------

Key: LUCENE-2404
URL: https://issues.apache.org/jira/browse/LUCENE-2404
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Reporter: Uwe Schindler
Assignee: Robert Muir
Fix For: 3.1
Attachments: LUCENE-2404.patch

The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.

The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Uwe Schindler (JIRA) at Apr 19, 2010 at 5:40 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Uwe Schindler updated LUCENE-2404:
    ----------------------------------

    Attachment: LUCENE-2404.patch
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 19, 2010 at 6:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Uwe Schindler updated LUCENE-2404:
    ----------------------------------

    Attachment: LUCENE-2404.patch

    New patch, which preserves backwards with matchVersion. It adds an LowerCaseFilter in the ctor of ThaiWordFilter automatically, so the bahviour does not change, except a second bug:
    The previous version of the filter did not correctly lowercase a token that contains "ThaiEnglishThai" text. As the filter is now plugged before, it will lowercase this correctly, so its a backwards break.

    Since Version 3.1, the filter is no longer automatically used, instead ThaiAnalyzer plugs it before the filter (I reversed the order in contrast to previous patch to have the same order in deprecated and actual case).
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Apr 19, 2010 at 6:18 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858618#action_12858618 ]

    Robert Muir commented on LUCENE-2404:
    -------------------------------------

    This is great. it already more than doubles the speed of this filter on english text...

    but this filter has always been cheating with the UnicodeBlock check on charAt(0), as you could have EnglishThaiEnglish too.
    it also cheats because it doesn't check that the break boundaries are words, and not just spaces or punctuation.

    I suppose the above two things are not much of a problem if you assume StandardTokenizer, but maybe a problem for
    other Tokenizers... tricky to figure out how to make it correct and still as fast as the 'cheating'

    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 19, 2010 at 6:26 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Uwe Schindler updated LUCENE-2404:
    ----------------------------------

    Attachment: LUCENE-2404-2.patch

    Another variant of the previous patch, slightly faster as Robert said, maybe we get an inspiration by that. It uses cloneAttributes and does not create new clones all the time.
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Apr 19, 2010 at 6:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858624#action_12858624 ]

    Robert Muir commented on LUCENE-2404:
    -------------------------------------

    Looking at Uwe's code points out there is another bug in the old filter,
    it does not really handle position increments correctly.

    So if there is a stopword or some other posInc it gets applied to all
    subwords... in my opinion only the first token should carry this.

    Otherwise you can have problems like SOLR-1852
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 19, 2010 at 7:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858645#action_12858645 ]

    Uwe Schindler commented on LUCENE-2404:
    ---------------------------------------

    Yes, this is not a problem if you use the ThaiAnalyzer, as StopFilter comes after this filter. But users of Solr plugging this filter after a StopFilter will have problems.
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 19, 2010 at 8:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Uwe Schindler updated LUCENE-2404:
    ----------------------------------

    Attachment: LUCENE-2404-2.patch

    New version of the cloneAttributes-version, that produces correct position increments with matchVersion=3.1.
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Apr 19, 2010 at 8:36 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858672#action_12858672 ]

    Robert Muir commented on LUCENE-2404:
    -------------------------------------

    Nice Uwe. So this patch fixes quite a few bugs and speeds things up... do you want to commit?
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Robert Muir
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 19, 2010 at 8:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Uwe Schindler reassigned LUCENE-2404:
    -------------------------------------

    Assignee: Uwe Schindler (was: Robert Muir)
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Uwe Schindler
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 19, 2010 at 8:44 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858678#action_12858678 ]

    Uwe Schindler commented on LUCENE-2404:
    ---------------------------------------

    I will commit this soon after adding changes.
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Uwe Schindler
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Apr 19, 2010 at 8:59 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Uwe Schindler resolved LUCENE-2404.
    -----------------------------------

    Resolution: Fixed

    Committed revisions: 935734 (Lucene) & 935739 (Solr)
    Improve speed of ThaiWordFilter by CharacterIterator, factor out LowerCasing and also fix some bugs (empty tokens stop iteration)
    ---------------------------------------------------------------------------------------------------------------------------------

    Key: LUCENE-2404
    URL: https://issues.apache.org/jira/browse/LUCENE-2404
    Project: Lucene - Java
    Issue Type: Bug
    Components: contrib/analyzers
    Reporter: Uwe Schindler
    Assignee: Uwe Schindler
    Fix For: 3.1

    Attachments: LUCENE-2404-2.patch, LUCENE-2404-2.patch, LUCENE-2404.patch, LUCENE-2404.patch


    The ThaiWordFilter creates new Strings out of term buffer before passing to The BreakIterator., But BreakIterator can take a CharacterIterator and directly process on it without buffer copying.
    As Java itsself does not provide a CharacterIterator implementation in java.text, we can use the javax.swing.text.Segment class, that operates on a char[] and is even reuseable! This class is very strange but it works and is in JDK 1.4+ and not deprecated.
    The filter also had a bug: It stopped iterating tokens when an empty token occurred. Also the lowercasing for non-thai words was removed and put into the Analyzer by adding LowerCaseFilter.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedApr 19, '10 at 5:40p
activeApr 19, '10 at 8:59p
posts12
users1
websitelucene.apache.org

1 user in discussion

Uwe Schindler (JIRA): 12 posts

People

Translate

site design / logo © 2021 Grokbase