Here's a unit test:
import junit.framework.TestCase;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.store.RAMDirectory;
public class SpanishTest extends TestCase {
public void testSpanish() throws Exception {
RAMDirectory directory = new RAMDirectory();
String content = "niños";
IndexWriter writer = new IndexWriter(directory, new
StandardAnalyzer(), true);
Document document = new Document();
document.add(new Field("name", content, Field.Store.YES,
Field.Index.TOKENIZED));
SnowballAnalyzer snowballAnalyzer = new
SnowballAnalyzer("Spanish");
writer.addDocument(document, snowballAnalyzer);
writer.close();
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser("name", snowballAnalyzer);
Query query = parser.parse(content);
System.out.println("Query: " + query);
Hits hits = searcher.search(query);
assertTrue("hits Size: " + hits.length() + " is not: " + 1,
hits.length() == 1);
Document theDoc = hits.doc(0);
String nombre = theDoc.get("name");
System.out.println("Nombre: " + nombre);
}
}
When I run this in IntelliJ, I get:
Query: name:niñ
Nombre: niños
Process finished with exit code 0
Are you by chance indexing XML?
On Aug 21, 2008, at 1:16 PM, Juan Pablo Morales wrote:
I have an index in Spanish and I use Snowball to stem and analyze
and it
works perfectly. However, I am running into trouble storing (not
indexing,
only storing) words that have special characters.
That is, I store the special character but the it comes garbled when
I read
it back.
To provide an example:
String content = "niños";
document.add(new Field("name",content,Store.YES, Index.Tokenized));
writer.addDocument(doc, new SnowballAnalyzer("Spanish"));
.
When I read the field back
String nombre = doc.get("name");
Then name will contain "ni�os"
Looking at the index with Luke it shows me "ni�os" but when I
want to
see the full text (by right clicking) it shows me ni�os.
I know Lucene is supposed to store fields in UTF8, but then, how can
I make
sure I sotre something and get it back just as it was, including
special
characters?
Thanks
--
Juan Pablo Morales
Ingenian Software ltda
Bogotá, Colombia
--------------------------
Grant Ingersoll
http://www.lucidimagination.comLucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformancehttp://wiki.apache.org/lucene-java/LuceneFAQ---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org