FAQ
Hi
I want to index data with utf-8 encoding, so when adding field to a document
I am using the code new String(value.getBytes("utf-8"))
in the other hand, when I am going to search I was using the same snippet
code to convert to utf-8 but it did not work so finally I found somewhere
that had been said to use new String(valueToSearch.getBytes("cp1252"),"UTF8")
and it worked fine but I still has some problem.
first, some characters are weird when I get result from lucene, It seems it
is in cp1252 encoding.
second, if the java environment property "file.encoding" not been cp1252 the
result is completely in incorrect encoding. so I must change this property
using System.setProperty("file.encoding","cp1252")

is lucene neglect my utf-8 encoding and proceed indexing data using cp1252?
how can I correct weird characters I received by searching?

Thank you very much in advance.
--
Regards,
Mohammad

Search Discussions

  • Chris Hostetter at Feb 14, 2007 at 8:50 am
    Internally Lucene deals with pure Java Strings; when writing those strings
    to and reading those strings back from disk, Lucene allways uses the stock
    Java "modified UTF-8" format, regardless of what your file.encoding
    system property may be.

    typcially when people have encoding problems in their lucene applications,
    the origin of hte problem is in the way they fetch the data before
    indexing it ... if you can make a String object, and System.out.println
    that string and see what you expect, then handing that string to Lucene as
    a field value should work fine.

    what exactly is the "value" object you are calling getBytes on? ... if
    it's another String, then you've already got serious problems -- i can't
    imagine any situation where fetching the bytes from a String in one
    charset and using those bytes to construct another string (either in a
    different charset, or in the system default charset) would make any sense
    at all.

    wherever your original binary data is coming from (files on disk, network
    socket, etcc...) that's when you should be converting those bytes into
    chars using whatever charset you know those bytes represent.



    : Date: Wed, 14 Feb 2007 09:16:58 +0330
    : From: Mohammad Norouzi <mnrz57@gmail.com>
    : Reply-To: java-user@lucene.apache.org
    : To: java-user@lucene.apache.org
    : Subject: encoding question.
    :
    : Hi
    : I want to index data with utf-8 encoding, so when adding field to a document
    : I am using the code new String(value.getBytes("utf-8"))
    : in the other hand, when I am going to search I was using the same snippet
    : code to convert to utf-8 but it did not work so finally I found somewhere
    : that had been said to use new String(valueToSearch.getBytes("cp1252"),"UTF8")
    : and it worked fine but I still has some problem.
    : first, some characters are weird when I get result from lucene, It seems it
    : is in cp1252 encoding.
    : second, if the java environment property "file.encoding" not been cp1252 the
    : result is completely in incorrect encoding. so I must change this property
    : using System.setProperty("file.encoding","cp1252")
    :
    : is lucene neglect my utf-8 encoding and proceed indexing data using cp1252?
    : how can I correct weird characters I received by searching?
    :
    : Thank you very much in advance.
    : --
    : Regards,
    : Mohammad
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benson Margulies at Feb 14, 2007 at 2:02 pm
    The usual source of this problem is HTML forms. If you want to get UTF-8
    back from a form, you have to send \the form itself/ to the browser in
    UTF-8.

    -----Original Message-----
    From: Chris Hostetter
    Sent: Wednesday, February 14, 2007 3:50 AM
    To: java-user@lucene.apache.org
    Subject: Re: encoding question.


    Internally Lucene deals with pure Java Strings; when writing those
    strings
    to and reading those strings back from disk, Lucene allways uses the
    stock
    Java "modified UTF-8" format, regardless of what your file.encoding
    system property may be.

    typcially when people have encoding problems in their lucene
    applications,
    the origin of hte problem is in the way they fetch the data before
    indexing it ... if you can make a String object, and System.out.println
    that string and see what you expect, then handing that string to Lucene
    as
    a field value should work fine.

    what exactly is the "value" object you are calling getBytes on? ... if
    it's another String, then you've already got serious problems -- i can't
    imagine any situation where fetching the bytes from a String in one
    charset and using those bytes to construct another string (either in a
    different charset, or in the system default charset) would make any
    sense
    at all.

    wherever your original binary data is coming from (files on disk,
    network
    socket, etcc...) that's when you should be converting those bytes into
    chars using whatever charset you know those bytes represent.



    : Date: Wed, 14 Feb 2007 09:16:58 +0330
    : From: Mohammad Norouzi <mnrz57@gmail.com>
    : Reply-To: java-user@lucene.apache.org
    : To: java-user@lucene.apache.org
    : Subject: encoding question.
    :
    : Hi
    : I want to index data with utf-8 encoding, so when adding field to a
    document
    : I am using the code new String(value.getBytes("utf-8"))
    : in the other hand, when I am going to search I was using the same
    snippet
    : code to convert to utf-8 but it did not work so finally I found
    somewhere
    : that had been said to use new
    String(valueToSearch.getBytes("cp1252"),"UTF8")
    : and it worked fine but I still has some problem.
    : first, some characters are weird when I get result from lucene, It
    seems it
    : is in cp1252 encoding.
    : second, if the java environment property "file.encoding" not been
    cp1252 the
    : result is completely in incorrect encoding. so I must change this
    property
    : using System.setProperty("file.encoding","cp1252")
    :
    : is lucene neglect my utf-8 encoding and proceed indexing data using
    cp1252?
    : how can I correct weird characters I received by searching?
    :
    : Thank you very much in advance.
    : --
    : Regards,
    : Mohammad
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Peter Keegan at Jul 19, 2007 at 6:53 pm
    The source data for my index is already in standard UTF-8 and available as a
    simple byte array. I need to do some simple tokenization of the data (check
    for whitespace and special characters that control position increment). What
    is the most efficient way to index this data and avoid unnecessary
    conversions to/from java Strings or char arrays? Looking at DocumentsWriter,
    I see that all terms are eventually converted to char arrays and written in
    modified-UTF-8, so there doesn't seem to be much advantage to having the
    source data in standard UTF-8.

    Peter

    On 2/14/07, Chris Hostetter wrote:


    Internally Lucene deals with pure Java Strings; when writing those strings
    to and reading those strings back from disk, Lucene allways uses the stock
    Java "modified UTF-8" format, regardless of what your file.encoding
    system property may be.

    typcially when people have encoding problems in their lucene applications,
    the origin of hte problem is in the way they fetch the data before
    indexing it ... if you can make a String object, and System.out.println
    that string and see what you expect, then handing that string to Lucene as
    a field value should work fine.

    what exactly is the "value" object you are calling getBytes on? ... if
    it's another String, then you've already got serious problems -- i can't
    imagine any situation where fetching the bytes from a String in one
    charset and using those bytes to construct another string (either in a
    different charset, or in the system default charset) would make any sense
    at all.

    wherever your original binary data is coming from (files on disk, network
    socket, etcc...) that's when you should be converting those bytes into
    chars using whatever charset you know those bytes represent.



    : Date: Wed, 14 Feb 2007 09:16:58 +0330
    : From: Mohammad Norouzi <mnrz57@gmail.com>
    : Reply-To: java-user@lucene.apache.org
    : To: java-user@lucene.apache.org
    : Subject: encoding question.
    :
    : Hi
    : I want to index data with utf-8 encoding, so when adding field to a
    document
    : I am using the code new String(value.getBytes("utf-8"))
    : in the other hand, when I am going to search I was using the same
    snippet
    : code to convert to utf-8 but it did not work so finally I found
    somewhere
    : that had been said to use new String(valueToSearch.getBytes
    ("cp1252"),"UTF8")
    : and it worked fine but I still has some problem.
    : first, some characters are weird when I get result from lucene, It seems
    it
    : is in cp1252 encoding.
    : second, if the java environment property "file.encoding" not been cp1252
    the
    : result is completely in incorrect encoding. so I must change this
    property
    : using System.setProperty("file.encoding","cp1252")
    :
    : is lucene neglect my utf-8 encoding and proceed indexing data using
    cp1252?
    : how can I correct weird characters I received by searching?
    :
    : Thank you very much in advance.
    : --
    : Regards,
    : Mohammad
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 14, '07 at 5:47a
activeJul 19, '07 at 6:53p
posts4
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase