FAQ
I have an index in Spanish and I use Snowball to stem and analyze and it
works perfectly. However, I am running into trouble storing (not indexing,
only storing) words that have special characters.

That is, I store the special character but the it comes garbled when I read
it back.
To provide an example:

String content = "niños";
document.add(new Field("name",content,Store.YES, Index.Tokenized));
writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
.
When I read the field back
String nombre = doc.get("name");

Then name will contain "ni�os"

Looking at the index with Luke it shows me "ni�os" but when I want to
see the full text (by right clicking) it shows me ni�os.

I know Lucene is supposed to store fields in UTF8, but then, how can I make
sure I sotre something and get it back just as it was, including special
characters?

Thanks
--
Juan Pablo Morales
Ingenian Software ltda
Bogotá, Colombia

Search Discussions

  • Steven A Rowe at Aug 21, 2008 at 5:47 pm
    Hola Juan,
    On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote:
    I have an index in Spanish and I use Snowball to stem and
    analyze and it works perfectly. However, I am running into
    trouble storing (not indexing, only storing) words that
    have special characters.

    That is, I store the special character but the it comes
    garbled when I read it back.
    To provide an example:

    String content = "niños";
    document.add(new Field("name",content,Store.YES, Index.Tokenized));
    writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
    If your source code is encoded as Latin-1, then it will likely appear to you to be the correct character (depending on the editor/viewer you're using and its configuration), but Java may not properly convert it to Unicode, depending on the encoding it expects your source code to be in (see the -encoding option to javac - if you don't specify it, then the platform default encoding is used). You could test whether this is the problem by instead trying:

    String content = "ni\u00F1os";
    ...
    Looking at the index with Luke it shows me "ni�os" but
    when I want to see the full text (by right clicking) it shows
    me ni os.
    � is the Unicode replacement character (U+FFFD), and it's routinely used, including within Lucene itself, as the substitute character for byte sequences that are not valid in the designated source encoding.
    I know Lucene is supposed to store fields in UTF8, but then,
    how can I make sure I sotre something and get it back just as
    it was, including special characters?
    Make sure that the data you give to Lucene is encoded properly, and then what you get back should also be.

    Please try the suggestion I gave you above ("ni\u00F1os"). If you still have the same problem, you may have found a bug - please report back what you find.

    Steve
  • Juan Pablo Morales at Aug 21, 2008 at 5:58 pm

    On Thu, Aug 21, 2008 at 12:47 PM, Steven A Rowe wrote:

    Hola Juan,
    Hi Steve
    On 08/21/2008 at 1:16 PM, Juan Pablo Morales wrote:
    I have an index in Spanish and I use Snowball to stem and
    analyze and it works perfectly. However, I am running into
    trouble storing (not indexing, only storing) words that
    have special characters.

    That is, I store the special character but the it comes
    garbled when I read it back.
    To provide an example:

    String content = "niños";
    document.add(new Field("name",content,Store.YES, Index.Tokenized));
    writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
    If your source code is encoded as Latin-1, then it will likely appear to
    you to be the correct character (depending on the editor/viewer you're using
    and its configuration), but Java may not properly convert it to Unicode,
    depending on the encoding it expects your source code to be in (see the
    -encoding option to javac - if you don't specify it, then the platform
    default encoding is used). You could test whether this is the problem by
    instead trying:

    String content = "ni\u00F1os";
    ...
    Looking at the index with Luke it shows me "ni�os" but
    when I want to see the full text (by right clicking) it shows
    me ni os.
    � is the Unicode replacement character (U+FFFD), and it's routinely
    used, including within Lucene itself, as the substitute character for byte
    sequences that are not valid in the designated source encoding.
    I know Lucene is supposed to store fields in UTF8, but then,
    how can I make sure I sotre something and get it back just as
    it was, including special characters?
    Make sure that the data you give to Lucene is encoded properly, and then
    what you get back should also be.

    Please try the suggestion I gave you above ("ni\u00F1os"). If you still
    have the same problem, you may have found a bug - please report back what
    you find.
    I just gave my example that string, ant it correctly wrote it on the screen
    as an ñ, but storing and retrieving it yielded the same results, that is,
    the character gets lost in translation.

    --
    Juan Pablo Morales
    Ingenian Software ltda
  • Grant Ingersoll at Aug 21, 2008 at 10:31 pm
    Here's a unit test:
    import junit.framework.TestCase;
    import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.queryParser.QueryParser;
    import org.apache.lucene.search.Hits;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.store.RAMDirectory;


    public class SpanishTest extends TestCase {

    public void testSpanish() throws Exception {
    RAMDirectory directory = new RAMDirectory();
    String content = "niños";
    IndexWriter writer = new IndexWriter(directory, new
    StandardAnalyzer(), true);
    Document document = new Document();
    document.add(new Field("name", content, Field.Store.YES,
    Field.Index.TOKENIZED));
    SnowballAnalyzer snowballAnalyzer = new
    SnowballAnalyzer("Spanish");
    writer.addDocument(document, snowballAnalyzer);
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);
    QueryParser parser = new QueryParser("name", snowballAnalyzer);
    Query query = parser.parse(content);
    System.out.println("Query: " + query);
    Hits hits = searcher.search(query);
    assertTrue("hits Size: " + hits.length() + " is not: " + 1,
    hits.length() == 1);
    Document theDoc = hits.doc(0);
    String nombre = theDoc.get("name");
    System.out.println("Nombre: " + nombre);
    }
    }


    When I run this in IntelliJ, I get:

    Query: name:niñ
    Nombre: niños

    Process finished with exit code 0


    Are you by chance indexing XML?


    On Aug 21, 2008, at 1:16 PM, Juan Pablo Morales wrote:

    I have an index in Spanish and I use Snowball to stem and analyze
    and it
    works perfectly. However, I am running into trouble storing (not
    indexing,
    only storing) words that have special characters.

    That is, I store the special character but the it comes garbled when
    I read
    it back.
    To provide an example:

    String content = "niños";
    document.add(new Field("name",content,Store.YES, Index.Tokenized));
    writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
    .
    When I read the field back
    String nombre = doc.get("name");

    Then name will contain "ni�os"

    Looking at the index with Luke it shows me "ni�os" but when I
    want to
    see the full text (by right clicking) it shows me ni�os.

    I know Lucene is supposed to store fields in UTF8, but then, how can
    I make
    sure I sotre something and get it back just as it was, including
    special
    characters?

    Thanks
    --
    Juan Pablo Morales
    Ingenian Software ltda
    Bogotá, Colombia
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Juan Pablo Morales at Aug 21, 2008 at 11:19 pm
    You are right, it does work. I'll look into my example to see where the
    difference is.
    On Thu, Aug 21, 2008 at 5:30 PM, Grant Ingersoll wrote:

    Here's a unit test:
    import junit.framework.TestCase;
    import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.queryParser.QueryParser;
    import org.apache.lucene.search.Hits;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.store.RAMDirectory;


    public class SpanishTest extends TestCase {

    public void testSpanish() throws Exception {
    RAMDirectory directory = new RAMDirectory();
    String content = "niños";
    IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(),
    true);
    Document document = new Document();
    document.add(new Field("name", content, Field.Store.YES,
    Field.Index.TOKENIZED));
    SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer("Spanish");
    writer.addDocument(document, snowballAnalyzer);
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);
    QueryParser parser = new QueryParser("name", snowballAnalyzer);
    Query query = parser.parse(content);
    System.out.println("Query: " + query);
    Hits hits = searcher.search(query);
    assertTrue("hits Size: " + hits.length() + " is not: " + 1,
    hits.length() == 1);
    Document theDoc = hits.doc(0);
    String nombre = theDoc.get("name");
    System.out.println("Nombre: " + nombre);
    }
    }


    When I run this in IntelliJ, I get:

    Query: name:niñ
    Nombre: niños

    Process finished with exit code 0


    Are you by chance indexing XML?
    Indirectly, yes
    --
    Juan Pablo Morales
    Ingenian Software ltda
  • Juan Pablo Morales at Aug 22, 2008 at 12:16 am
    It was, after all an XML issue, the servlets creating the content that was
    being indexed were not sending UTF but the XML declaration stated the code
    WAS UTF, so it really was not a Lucene issue after all. Thanks for all the
    help.

    On Thu, Aug 21, 2008 at 6:18 PM, Juan Pablo Morales
    wrote:
    You are right, it does work. I'll look into my example to see where the
    difference is.
    On Thu, Aug 21, 2008 at 5:30 PM, Grant Ingersoll wrote:

    Here's a unit test:
    import junit.framework.TestCase;
    import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
    import org.apache.lucene.analysis.standard.StandardAnalyzer;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
    import org.apache.lucene.index.IndexWriter;
    import org.apache.lucene.queryParser.QueryParser;
    import org.apache.lucene.search.Hits;
    import org.apache.lucene.search.IndexSearcher;
    import org.apache.lucene.search.Query;
    import org.apache.lucene.store.RAMDirectory;


    public class SpanishTest extends TestCase {

    public void testSpanish() throws Exception {
    RAMDirectory directory = new RAMDirectory();
    String content = "niños";
    IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(),
    true);
    Document document = new Document();
    document.add(new Field("name", content, Field.Store.YES,
    Field.Index.TOKENIZED));
    SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer("Spanish");
    writer.addDocument(document, snowballAnalyzer);
    writer.close();

    IndexSearcher searcher = new IndexSearcher(directory);
    QueryParser parser = new QueryParser("name", snowballAnalyzer);
    Query query = parser.parse(content);
    System.out.println("Query: " + query);
    Hits hits = searcher.search(query);
    assertTrue("hits Size: " + hits.length() + " is not: " + 1,
    hits.length() == 1);
    Document theDoc = hits.doc(0);
    String nombre = theDoc.get("name");
    System.out.println("Nombre: " + nombre);
    }
    }


    When I run this in IntelliJ, I get:

    Query: name:niñ
    Nombre: niños

    Process finished with exit code 0


    Are you by chance indexing XML?
    Indirectly, yes
    --
    Juan Pablo Morales
    Ingenian Software ltda


    --
    Juan Pablo Morales
    Ingenian Software ltda

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 21, '08 at 5:17p
activeAug 22, '08 at 12:16a
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase