Robert Muir updated LUCENE-1799:
--------------------------------
Attachment: LUCENE-1799.patch
attached is a simple prototype for encoding terms as BOCU-1
So while I don't think things like wildcard, etc will work due to the nature of BOCU-1, term and phrase queries should work fine, and it maintains UTF-8 order so sorting is fine, and range queries should work once we fix TermRangeQuery to use byte.
the impl is probably a bit slow (uses charset api) as its just for playing around.
note: I didnt check the box because of the patent thing, (not sure it even applies since i use the icu impl here), but either way i dont want to involve myself with that.
Unicode compression
-------------------
Key: LUCENE-1799
URL: https://issues.apache.org/jira/browse/LUCENE-1799
Project: Lucene - Java
Issue Type: New Feature
Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
Attachments: LUCENE-1799.patch
In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
This led to the comment that a different or compressed encoding would be a generally useful feature.
BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
SCSU is another Unicode compression algorithm that could be used.
An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.
---------------------
Key: LUCENE-1799
URL: https://issues.apache.org/jira/browse/LUCENE-1799
Project: Lucene - Java
Issue Type: New Feature
Components: Store
Affects Versions: 2.4.1
Reporter: DM Smith
Priority: Minor
Attachments: LUCENE-1799.patch
In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.
This led to the comment that a different or compressed encoding would be a generally useful feature.
BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.
SCSU is another Unicode compression algorithm that could be used.
An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org