FAQ
[ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932207#action_12932207 ]

Earwin Burrfoot commented on LUCENE-1799:
-----------------------------------------

.. and not the Codec, as was suggested in the beginning.
Unicode compression
-------------------

Key: LUCENE-1799
URL: https://issues.apache.org/jira/browse/LUCENE-1799
Project: Lucene - Java
Issue Type: New Feature
Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
Attachments: Benchmark.java, Benchmark.java, Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
This led to the comment that a different or compressed encoding would be a generally useful feature.
BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
SCSU is another Unicode compression algorithm that could be used.
An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Earwin Burrfoot (JIRA) at Nov 15, 2010 at 9:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932206#action_12932206 ]

    Earwin Burrfoot commented on LUCENE-1799:
    -----------------------------------------

    Returning to this issue, right now the best place for this functionality seems to be a variant of CharTermAttribute?
    Unicode compression
    -------------------

    Key: LUCENE-1799
    URL: https://issues.apache.org/jira/browse/LUCENE-1799
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Store
    Affects Versions: 2.4.1
    Reporter: DM Smith
    Priority: Minor
    Attachments: Benchmark.java, Benchmark.java, Benchmark.java, LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch


    In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
    This led to the comment that a different or compressed encoding would be a generally useful feature.
    BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
    SCSU is another Unicode compression algorithm that could be used.
    An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedNov 15, '10 at 9:30p
activeNov 15, '10 at 9:30p
posts2
users1
websitelucene.apache.org

1 user in discussion

Earwin Burrfoot (JIRA): 2 posts

People

Translate

site design / logo © 2022 Grokbase