FAQ
Dear list,

a very basic question about lucene, which version of
unicode can be handled (indexed and searched) with lucene?

It looks like lucene can only handle the very old Unicode 2.0
but not the newer 3.1 version (4 byte utf-8 unicode).

Is that true?

Regards,
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Simon Willnauer at Feb 25, 2011 at 11:05 am
    Hey Bernd,

    On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
    wrote:
    Dear list,

    a very basic question about lucene, which version of
    unicode can be handled (indexed and searched) with lucene?
    if you ask for what the indexer / query can handle then it is really
    what UTF-8 can handle. Strings passed to the writer / reader are
    converted to UTF-8 internally (rough picture). On Trunk we are
    indexing bytes only (UTF-8 bytes by default). so the question is
    really what you platform supports in terms of utilities / operations
    on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
    have the possibility to respect code points which are above the BMP.
    Lucene 2.9 still has Java 1.4 System Requirements that prevented us
    from moving forward to Unicode 4.0. If you look at Character.java all
    methods have been converted to operate on UTF-32 code points instead
    of UTF-16 code points in Java 1.4.

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
    (I think there are one or two which still have problems, I should
    check... Robert did we fix all NGram stuff?).

    So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
    support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
    I hope) will fix most of the problems and includes ICU based analysis
    for full Unicode 5 support.

    hope that helps

    simon
    It looks like lucene can only handle the very old Unicode 2.0
    but not the newer 3.1 version (4 byte utf-8 unicode).

    Is that true?

    Regards,
    Bernd

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bernd Fehling at Feb 25, 2011 at 12:03 pm
    Hi Simon,

    thanks for the details.

    My platform supports and uses code above BMP (0x10000 and up).
    So the limit is Lucene.
    Don't know how to handle this problem.
    May be deleting all code above BMP...???

    Good to hear that Lucene 3.1 will come soon.
    Any rough estimation when Lucene 3.1 will be available?

    Regards,
    Bernd

    Am 25.02.2011 12:04, schrieb Simon Willnauer:
    Hey Bernd,

    On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
    wrote:
    Dear list,

    a very basic question about lucene, which version of
    unicode can be handled (indexed and searched) with lucene?
    if you ask for what the indexer / query can handle then it is really
    what UTF-8 can handle. Strings passed to the writer / reader are
    converted to UTF-8 internally (rough picture). On Trunk we are
    indexing bytes only (UTF-8 bytes by default). so the question is
    really what you platform supports in terms of utilities / operations
    on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
    have the possibility to respect code points which are above the BMP.
    Lucene 2.9 still has Java 1.4 System Requirements that prevented us
    from moving forward to Unicode 4.0. If you look at Character.java all
    methods have been converted to operate on UTF-32 code points instead
    of UTF-16 code points in Java 1.4.

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
    (I think there are one or two which still have problems, I should
    check... Robert did we fix all NGram stuff?).

    So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
    support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
    I hope) will fix most of the problems and includes ICU based analysis
    for full Unicode 5 support.

    hope that helps

    simon
    It looks like lucene can only handle the very old Unicode 2.0
    but not the newer 3.1 version (4 byte utf-8 unicode).

    Is that true?

    Regards,
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Simon Willnauer at Feb 25, 2011 at 12:44 pm

    On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling wrote:
    Hi Simon,

    thanks for the details.

    My platform supports and uses code above BMP (0x10000 and up).
    So the limit is Lucene.
    Don't know how to handle this problem.
    May be deleting all code above BMP...???
    the code will work fine even if they are in you text. It will just not
    respect them maybe throw them away during tokenization etc. so it
    really depends what you are using on the analyzer side. maybe you can
    give us little more details on what you use for analysis. One option
    would be to build 3.1 from the source and use the analyzers from
    there?!
    Good to hear that Lucene 3.1 will come soon.
    Any rough estimation when Lucene 3.1 will be available?
    I hope it will happen within the next 4 weeks

    simon
    Regards,
    Bernd

    Am 25.02.2011 12:04, schrieb Simon Willnauer:
    Hey Bernd,

    On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
    wrote:
    Dear list,

    a very basic question about lucene, which version of
    unicode can be handled (indexed and searched) with lucene?
    if you ask for what the indexer / query can handle then it is really
    what UTF-8 can handle. Strings passed to the writer / reader are
    converted to UTF-8 internally (rough picture). On Trunk we are
    indexing bytes only (UTF-8 bytes by default). so the question is
    really what you platform supports in terms of utilities / operations
    on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
    have the possibility to respect code points which are above the BMP.
    Lucene 2.9 still has Java 1.4 System Requirements that prevented us
    from moving forward to Unicode 4.0. If you look at Character.java all
    methods have been converted to operate on UTF-32 code points instead
    of UTF-16 code points in Java 1.4.

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
    (I think there are one or two which still have problems, I should
    check... Robert did we fix all NGram stuff?).

    So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
    support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
    I hope) will fix most of the problems and includes ICU based analysis
    for full Unicode 5 support.

    hope that helps

    simon
    It looks like lucene can only handle the very old Unicode 2.0
    but not the newer 3.1 version (4 byte utf-8 unicode).

    Is that true?

    Regards,
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bernd Fehling at Feb 25, 2011 at 1:19 pm
    Hi Simon,

    actually I'm working with Solr from trunk but followed the problem
    all the way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.

    My field is:
    <field name="dcdescription" type="string" indexed="false" stored="true" />

    No analysis done at all, just stored the content for result display.
    But the result is unpredictable and can end in invalid utf-8 code.

    Regards,
    Bernd


    Am 25.02.2011 13:43, schrieb Simon Willnauer:
    On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
    wrote:
    Hi Simon,

    thanks for the details.

    My platform supports and uses code above BMP (0x10000 and up).
    So the limit is Lucene.
    Don't know how to handle this problem.
    May be deleting all code above BMP...???
    the code will work fine even if they are in you text. It will just not
    respect them maybe throw them away during tokenization etc. so it
    really depends what you are using on the analyzer side. maybe you can
    give us little more details on what you use for analysis. One option
    would be to build 3.1 from the source and use the analyzers from
    there?!
    Good to hear that Lucene 3.1 will come soon.
    Any rough estimation when Lucene 3.1 will be available?
    I hope it will happen within the next 4 weeks

    simon
    Regards,
    Bernd

    Am 25.02.2011 12:04, schrieb Simon Willnauer:
    Hey Bernd,

    On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
    wrote:
    Dear list,

    a very basic question about lucene, which version of
    unicode can be handled (indexed and searched) with lucene?
    if you ask for what the indexer / query can handle then it is really
    what UTF-8 can handle. Strings passed to the writer / reader are
    converted to UTF-8 internally (rough picture). On Trunk we are
    indexing bytes only (UTF-8 bytes by default). so the question is
    really what you platform supports in terms of utilities / operations
    on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
    have the possibility to respect code points which are above the BMP.
    Lucene 2.9 still has Java 1.4 System Requirements that prevented us
    from moving forward to Unicode 4.0. If you look at Character.java all
    methods have been converted to operate on UTF-32 code points instead
    of UTF-16 code points in Java 1.4.

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
    (I think there are one or two which still have problems, I should
    check... Robert did we fix all NGram stuff?).

    So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
    support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
    I hope) will fix most of the problems and includes ICU based analysis
    for full Unicode 5 support.

    hope that helps

    simon
    It looks like lucene can only handle the very old Unicode 2.0
    but not the newer 3.1 version (4 byte utf-8 unicode).

    Is that true?

    Regards,
    Bernd
    --
    *************************************************************
    Bernd Fehling Universitätsbibliothek Bielefeld
    Dipl.-Inform. (FH) Universitätsstr. 25
    Tel. +49 521 106-4060 Fax. +49 521 106-4052
    bernd.fehling@uni-bielefeld.de 33615 Bielefeld

    BASE - Bielefeld Academic Search Engine - www.base-search.net
    *************************************************************

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Feb 25, 2011 at 1:40 pm
    Solr trunk is using Lucene trunk since Lucene and Solr are merged.

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Bernd Fehling
    Sent: Friday, February 25, 2011 2:19 PM
    To: simon.willnauer@gmail.com
    Cc: java-user@lucene.apache.org
    Subject: Re: which unicode version is supported with lucene

    Hi Simon,

    actually I'm working with Solr from trunk but followed the problem all the
    way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.

    My field is:
    <field name="dcdescription" type="string" indexed="false" stored="true" />

    No analysis done at all, just stored the content for result display.
    But the result is unpredictable and can end in invalid utf-8 code.

    Regards,
    Bernd


    Am 25.02.2011 13:43, schrieb Simon Willnauer:
    On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
    wrote:
    Hi Simon,

    thanks for the details.

    My platform supports and uses code above BMP (0x10000 and up).
    So the limit is Lucene.
    Don't know how to handle this problem.
    May be deleting all code above BMP...???
    the code will work fine even if they are in you text. It will just not
    respect them maybe throw them away during tokenization etc. so it
    really depends what you are using on the analyzer side. maybe you can
    give us little more details on what you use for analysis. One option
    would be to build 3.1 from the source and use the analyzers from
    there?!
    Good to hear that Lucene 3.1 will come soon.
    Any rough estimation when Lucene 3.1 will be available?
    I hope it will happen within the next 4 weeks

    simon
    Regards,
    Bernd

    Am 25.02.2011 12:04, schrieb Simon Willnauer:
    Hey Bernd,

    On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
    wrote:
    Dear list,

    a very basic question about lucene, which version of unicode can be
    handled (indexed and searched) with lucene?
    if you ask for what the indexer / query can handle then it is really
    what UTF-8 can handle. Strings passed to the writer / reader are
    converted to UTF-8 internally (rough picture). On Trunk we are
    indexing bytes only (UTF-8 bytes by default). so the question is
    really what you platform supports in terms of utilities / operations
    on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
    have the possibility to respect code points which are above the BMP.
    Lucene 2.9 still has Java 1.4 System Requirements that prevented us
    from moving forward to Unicode 4.0. If you look at Character.java
    all methods have been converted to operate on UTF-32 code points
    instead of UTF-16 code points in Java 1.4.

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer
    codebase (I think there are one or two which still have problems, I
    should check... Robert did we fix all NGram stuff?).

    So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
    support characters within the BMP <= 0xFFFF. 3.1 (to be released
    soon I hope) will fix most of the problems and includes ICU based
    analysis for full Unicode 5 support.

    hope that helps

    simon
    It looks like lucene can only handle the very old Unicode 2.0 but
    not the newer 3.1 version (4 byte utf-8 unicode).

    Is that true?

    Regards,
    Bernd
    --
    **********************************************************
    ***
    Bernd Fehling Universitätsbibliothek Bielefeld
    Dipl.-Inform. (FH) Universitätsstr. 25
    Tel. +49 521 106-4060 Fax. +49 521 106-4052
    bernd.fehling@uni-bielefeld.de 33615 Bielefeld

    BASE - Bielefeld Academic Search Engine - www.base-search.net
    **********************************************************
    ***

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bernd Fehling at Feb 25, 2011 at 1:49 pm
    So Solr trunk should already handle Unicode above BMP for field type string?
    Strange...

    Regards,
    Bernd

    Am 25.02.2011 14:40, schrieb Uwe Schindler:
    Solr trunk is using Lucene trunk since Lucene and Solr are merged.

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Bernd Fehling
    Sent: Friday, February 25, 2011 2:19 PM
    To: simon.willnauer@gmail.com
    Cc: java-user@lucene.apache.org
    Subject: Re: which unicode version is supported with lucene

    Hi Simon,

    actually I'm working with Solr from trunk but followed the problem all the
    way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.

    My field is:
    <field name="dcdescription" type="string" indexed="false" stored="true" />

    No analysis done at all, just stored the content for result display.
    But the result is unpredictable and can end in invalid utf-8 code.

    Regards,
    Bernd


    Am 25.02.2011 13:43, schrieb Simon Willnauer:
    On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
    wrote:
    Hi Simon,

    thanks for the details.

    My platform supports and uses code above BMP (0x10000 and up).
    So the limit is Lucene.
    Don't know how to handle this problem.
    May be deleting all code above BMP...???
    the code will work fine even if they are in you text. It will just not
    respect them maybe throw them away during tokenization etc. so it
    really depends what you are using on the analyzer side. maybe you can
    give us little more details on what you use for analysis. One option
    would be to build 3.1 from the source and use the analyzers from
    there?!
    Good to hear that Lucene 3.1 will come soon.
    Any rough estimation when Lucene 3.1 will be available?
    I hope it will happen within the next 4 weeks

    simon
    Regards,
    Bernd

    Am 25.02.2011 12:04, schrieb Simon Willnauer:
    Hey Bernd,

    On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
    wrote:
    Dear list,

    a very basic question about lucene, which version of unicode can be
    handled (indexed and searched) with lucene?
    if you ask for what the indexer / query can handle then it is really
    what UTF-8 can handle. Strings passed to the writer / reader are
    converted to UTF-8 internally (rough picture). On Trunk we are
    indexing bytes only (UTF-8 bytes by default). so the question is
    really what you platform supports in terms of utilities / operations
    on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
    have the possibility to respect code points which are above the BMP.
    Lucene 2.9 still has Java 1.4 System Requirements that prevented us
    from moving forward to Unicode 4.0. If you look at Character.java
    all methods have been converted to operate on UTF-32 code points
    instead of UTF-16 code points in Java 1.4.

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer
    codebase (I think there are one or two which still have problems, I
    should check... Robert did we fix all NGram stuff?).

    So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
    support characters within the BMP <= 0xFFFF. 3.1 (to be released
    soon I hope) will fix most of the problems and includes ICU based
    analysis for full Unicode 5 support.

    hope that helps

    simon
    It looks like lucene can only handle the very old Unicode 2.0 but
    not the newer 3.1 version (4 byte utf-8 unicode).

    Is that true?

    Regards,
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Feb 25, 2011 at 1:53 pm
    What APIs are you using to communicate with Solr? If you are using XML it may be limited by the XML parser used... If you are using SolrJ with binary request handler it should in all cases go through.

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de

    -----Original Message-----
    From: Bernd Fehling
    Sent: Friday, February 25, 2011 2:48 PM
    To: java-user@lucene.apache.org
    Subject: Re: which unicode version is supported with lucene


    So Solr trunk should already handle Unicode above BMP for field type string?
    Strange...

    Regards,
    Bernd

    Am 25.02.2011 14:40, schrieb Uwe Schindler:
    Solr trunk is using Lucene trunk since Lucene and Solr are merged.

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Bernd Fehling
    Sent: Friday, February 25, 2011 2:19 PM
    To: simon.willnauer@gmail.com
    Cc: java-user@lucene.apache.org
    Subject: Re: which unicode version is supported with lucene

    Hi Simon,

    actually I'm working with Solr from trunk but followed the problem
    all the way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.

    My field is:
    <field name="dcdescription" type="string" indexed="false"
    stored="true" />

    No analysis done at all, just stored the content for result display.
    But the result is unpredictable and can end in invalid utf-8 code.

    Regards,
    Bernd


    Am 25.02.2011 13:43, schrieb Simon Willnauer:
    On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
    wrote:
    Hi Simon,

    thanks for the details.

    My platform supports and uses code above BMP (0x10000 and up).
    So the limit is Lucene.
    Don't know how to handle this problem.
    May be deleting all code above BMP...???
    the code will work fine even if they are in you text. It will just
    not respect them maybe throw them away during tokenization etc. so
    it really depends what you are using on the analyzer side. maybe you
    can give us little more details on what you use for analysis. One
    option would be to build 3.1 from the source and use the analyzers
    from there?!
    Good to hear that Lucene 3.1 will come soon.
    Any rough estimation when Lucene 3.1 will be available?
    I hope it will happen within the next 4 weeks

    simon
    Regards,
    Bernd

    Am 25.02.2011 12:04, schrieb Simon Willnauer:
    Hey Bernd,

    On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
    wrote:
    Dear list,

    a very basic question about lucene, which version of unicode can
    be handled (indexed and searched) with lucene?
    if you ask for what the indexer / query can handle then it is
    really what UTF-8 can handle. Strings passed to the writer /
    reader are converted to UTF-8 internally (rough picture). On Trunk
    we are indexing bytes only (UTF-8 bytes by default). so the
    question is really what you platform supports in terms of
    utilities / operations on characters and strings. Since Lucene 3.0
    we are on Java 1.5 and have the possibility to respect code points
    which are above the BMP.
    Lucene 2.9 still has Java 1.4 System Requirements that prevented
    us from moving forward to Unicode 4.0. If you look at
    Character.java all methods have been converted to operate on
    UTF-32 code points instead of UTF-16 code points in Java 1.4.

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer
    codebase (I think there are one or two which still have problems,
    I should check... Robert did we fix all NGram stuff?).

    So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
    support characters within the BMP <= 0xFFFF. 3.1 (to be released
    soon I hope) will fix most of the problems and includes ICU based
    analysis for full Unicode 5 support.

    hope that helps

    simon
    It looks like lucene can only handle the very old Unicode 2.0 but
    not the newer 3.1 version (4 byte utf-8 unicode).

    Is that true?

    Regards,
    Bernd
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Feb 25, 2011 at 1:55 pm

    On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling wrote:
    So Solr trunk should already handle Unicode above BMP for field type string?
    Strange...
    One issue is that jetty doesn't support UTF-8 beyond the BMP:

    /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
    Solr server is up.
    HTTP GET is accepting UTF-8
    HTTP POST is accepting UTF-8
    HTTP POST defaults to UTF-8
    ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
    multilingual plane

    -Yonik
    http://lucidimagination.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bernd Fehling at Feb 25, 2011 at 2:10 pm
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?

    Regards,
    Bernd

    Am 25.02.2011 14:54, schrieb Yonik Seeley:
    On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
    wrote:
    So Solr trunk should already handle Unicode above BMP for field type string?
    Strange...
    One issue is that jetty doesn't support UTF-8 beyond the BMP:

    /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
    Solr server is up.
    HTTP GET is accepting UTF-8
    HTTP POST is accepting UTF-8
    HTTP POST defaults to UTF-8
    ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
    multilingual plane

    -Yonik
    http://lucidimagination.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Feb 25, 2011 at 2:17 pm

    On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling wrote:
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?
    Tomcat's defaults are worse - you need to configure it to use UTF-8 by
    default for URLs.
    Once you do, it passes all those tests (last I checked). Those tests
    are really about UTF-8 working in GET/POST query arguments. Solr may
    still be able to handle indexing and returning full UTF-8, but you
    wouldn't be able to query for it w/o using surrogates if you're using
    Jetty.

    It would be good to test though - does anyone know how to add a char
    above the BMP to utf8-example.xml?

    -Yonik
    http://lucidimagination.com

    Regards,
    Bernd

    Am 25.02.2011 14:54, schrieb Yonik Seeley:
    On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
    wrote:
    So Solr trunk should already handle Unicode above BMP for field type string?
    Strange...
    One issue is that jetty doesn't support UTF-8 beyond the BMP:

    /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
    Solr server is up.
    HTTP GET is accepting UTF-8
    HTTP POST is accepting UTF-8
    HTTP POST defaults to UTF-8
    ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
    multilingual plane

    -Yonik
    http://lucidimagination.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Feb 25, 2011 at 2:32 pm

    On Fri, Feb 25, 2011 at 9:16 AM, Yonik Seeley wrote:

    On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
    wrote:
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?
    Tomcat's defaults are worse - you need to configure it to use UTF-8 by
    default for URLs.
    Once you do, it passes all those tests (last I checked).  Those tests
    are really about UTF-8 working in GET/POST query arguments.  Solr may
    still be able to handle indexing and returning full UTF-8, but you
    wouldn't be able to query for it w/o using surrogates if you're using
    Jetty.

    It would be good to test though - does anyone know how to add a char
    above the BMP to utf8-example.xml?
    I tried the following, then tried to search on this character (U+29B05
    / UTF8:[f0 a9 ac 85]) with jetty and got no results.
    I also went to the analysis.jsp as a quick test, and noted that jetty
    treats it as if it were U+9B05 / UTF8: [e9 ac 85].

    Then i searched on 'range' via the admin gui to retrieve this
    document, and chrome blew up with "This page contains the following
    errors: error on line 17 at column 306: Encoding error"

    Didn't try tomcat.

    Index: utf8-example.xml
    ===================================================================
    --- utf8-example.xml (revision 1074125)
    +++ utf8-example.xml (working copy)
    @@ -34,6 +34,7 @@
    <field name="features">eaiou with umlauts: ëäïöü</field>
    <field name="features">tag with escaped chars: &lt;nicetag/&gt;</field>
    <field name="features">escaped ampersand: Bonnie &amp; Clyde</field>
    +    <field name="features">full unicode range (supplementary char): 𩬅</field>
    <field name="price">0</field>
    <!-- no popularity, get the default from schema.xml -->
    <field name="inStock">true</field>

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Feb 25, 2011 at 3:05 pm

    On Fri, Feb 25, 2011 at 9:31 AM, Robert Muir wrote:
    Then i searched on 'range' via the admin gui to retrieve this
    document, and chrome blew up with "This page contains the following
    errors: error on line 17 at column 306: Encoding error"
    I got an error in firefox too.
    I added the following example (commented out for now):
    <field name="features">Outside the BMP:𐌈 codepoint=10308, a
    circle with an x inside. UTF8=f0908c88 UTF16=d800 df08</field>

    I can verify it got into Solr OK by querying with python format (which
    escapes everything outside the ascii range for each 16 bit char):
    http://localhost:8983/solr/select?q=BMP&wt=python&indent=true

    [...]
    u'Outside the BMP:\ud800\udf08 codepoint=10308, a circle
    with an x inside. UTF8=f0908c88 UTF16=d800 df08']}]

    But firefox complains on XML output, and any other output like JSON it
    looks mangled.
    My bet is Jetty's UTF8 encoding for the response also doesn't handle
    the full range.

    -Yonik
    http://lucidimagination.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Feb 25, 2011 at 4:29 pm

    On Fri, Feb 25, 2011 at 10:04 AM, Yonik Seeley wrote:
    But firefox complains on XML output, and any other output like JSON it
    looks mangled.
    My bet is Jetty's UTF8 encoding for the response also doesn't handle
    the full range.
    I created a JIRA issue on jetty's issue tracker with a tentative fix:
    http://jira.codehaus.org/browse/JETTY-1340

    Our test_utf8.sh passes with this.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bernd Fehling at Feb 25, 2011 at 2:47 pm
    I just tried vim as editor, seams to work.

    - start vim
    - enter i (for insert)
    - enter <ctrl>+v and then <shift>+U (for uppercase U)
    - enter upper Unicode with 8 digits
    (e.g. 0001D5A0 for U+1D5A0 [MATHEMATICAL SANS-SERIF CAPITAL A])


    Am 25.02.2011 15:16, schrieb Yonik Seeley:
    On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
    wrote:
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?
    Tomcat's defaults are worse - you need to configure it to use UTF-8 by
    default for URLs.
    Once you do, it passes all those tests (last I checked). Those tests
    are really about UTF-8 working in GET/POST query arguments. Solr may
    still be able to handle indexing and returning full UTF-8, but you
    wouldn't be able to query for it w/o using surrogates if you're using
    Jetty.

    It would be good to test though - does anyone know how to add a char
    above the BMP to utf8-example.xml?

    -Yonik
    http://lucidimagination.com

    Regards,
    Bernd

    Am 25.02.2011 14:54, schrieb Yonik Seeley:
    On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
    wrote:
    So Solr trunk should already handle Unicode above BMP for field type string?
    Strange...
    One issue is that jetty doesn't support UTF-8 beyond the BMP:

    /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
    Solr server is up.
    HTTP GET is accepting UTF-8
    HTTP POST is accepting UTF-8
    HTTP POST defaults to UTF-8
    ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
    ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
    multilingual plane

    -Yonik
    http://lucidimagination.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Feb 25, 2011 at 10:02 pm

    On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling wrote:
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?
    Hi Bernd, I placed some patched Jetty jar files on
    https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.

    Maybe then you can get past your problem with Jetty.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bernd Fehling at Feb 27, 2011 at 2:04 pm
    Hi Robert,

    thanks to you and Yonik for looking into this.
    As soon as Apache jira is back online I will try your jetty version
    and give feedback.

    Regards,
    Bernd
    On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
    wrote:
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?
    Hi Bernd, I placed some patched Jetty jar files on
    https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.

    Maybe then you can get past your problem with Jetty.

    -----------------------------------------------------------------
    ----
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Feb 27, 2011 at 2:19 pm
    It's back online! It would be good, if you could confirm, we did hard work
    to fix this and report the bugs in Jetty to Jetty itself

    Thanks,
    Uwe!

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Bernd Fehling
    Sent: Sunday, February 27, 2011 3:04 PM
    To: java-user@lucene.apache.org
    Subject: Re: which unicode version is supported with lucene

    Hi Robert,

    thanks to you and Yonik for looking into this.
    As soon as Apache jira is back online I will try your jetty version and give
    feedback.

    Regards,
    Bernd
    On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
    wrote:
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?
    Hi Bernd, I placed some patched Jetty jar files on
    https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.

    Maybe then you can get past your problem with Jetty.

    -----------------------------------------------------------------
    ----
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bernd Fehling at Feb 27, 2011 at 7:16 pm
    Jepp, its back online.
    Just did a short test and reported my results to jira, but is the
    error from the xml output still a jetty problem or is it from XMLwriter?

    Regards, Bernd
    It's back online! It would be good, if you could confirm, we did
    hard work
    to fix this and report the bugs in Jetty to Jetty itself

    Thanks,
    Uwe!

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Bernd Fehling
    Sent: Sunday, February 27, 2011 3:04 PM
    To: java-user@lucene.apache.org
    Subject: Re: which unicode version is supported with lucene

    Hi Robert,

    thanks to you and Yonik for looking into this.
    As soon as Apache jira is back online I will try your jetty
    version and
    give
    feedback.

    Regards,
    Bernd
    On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
    wrote:
    Hi Yonik,

    good point, yes we are using Jetty.
    Do you know if Tomcat has this limitation?
    Hi Bernd, I placed some patched Jetty jar files on
    https://issues.apache.org/jira/browse/SOLR-2381 for the meantime.

    Maybe then you can get past your problem with Jetty.

    -------------------------------------------------------------
    ----
    ----
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------------------------------------------- ------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    -----------------------------------------------------------------
    ----
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Feb 27, 2011 at 8:43 pm

    On Sun, Feb 27, 2011 at 2:15 PM, Bernd Fehling wrote:
    Jepp, its back online.
    Just did a short test and reported my results to jira, but is the
    error from the xml output still a jetty problem or is it from XMLwriter?
    The patch has been committed, so you should just be able to try trunk (or 3x).

    I also just committed a char beyond the BMP to utf8-example.xml
    and the indexing and XML output works fine for me.

    Index the example docs, then do a query for "BMP" to bring up that document.

    -Yonik
    http://lucidimagination.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Muir at Feb 25, 2011 at 1:07 pm

    On Fri, Feb 25, 2011 at 6:04 AM, Simon Willnauer wrote:

    Since 3.0 is a Java Generics / move to Java 1.5 only release these
    APIs are not in use yet in the latest released version. Lucene 3.1
    holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
    (I think there are one or two which still have problems, I should
    check... Robert did we fix all NGram stuff?).
    No... and honestly they have other serious problems (such as only looking at
    first 1024 chars of input in the document, look at the jira issues). I
    recommend against using them in general, but definitely if you have
    codepoints outside of the BMP...

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 25, '11 at 10:24a
activeFeb 27, '11 at 8:43p
posts21
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase