On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell wrote:
> I was wondering if anyone every came across a problem I seem to be > having. I'm indexing in text files using some basic code written in C+ > +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly > (but very rarely) even some other format - I have no way of knowing.There are ways to detect the character set of a file, though not always
100% reliably.
> Question is, does Xapian convert none UTF-8 characters when it stores > the document. I think I read that UTF-8 is the default encoding for > Xapian, which is exactly what I am after.Most of Xapian treats things as opaque data. The classes which need
to know are Xapian::Stem, Xapian::QueryParser, and
Xapian::TermGenerator. The UTF-8 parsing used by the latter two will
treat invalid sequences as if they were ISO-8859-1, which for
real-world examples will almost always do the right thing when fed
ISO-8859-1. Xapian::Stem uses Snowball's UTF-8 parsing code currently -
I'm not sure how that handles invalid sequences.
> The reason I'm asking is that I am getting some seriously corrupted > characters in the index. When they are displayed on Tomcat I get a > "sun.io.MalformedInputException" when trying to display the search > results. I have set the pages charset to UTF-8 and apparently this > error is thrown when when the streamreader detects characters that are > not proper UTF-8 characters.If you set document data, document values, or directly add terms (using
Document::add_posting() or Document::add_term()) then you'll get back
what you put in verbatim. So if you pass in something which is invalid
UTF-8, it will still be invalid.
If you pass data through Xapian::Utf8Iterator before doing anything with
it, then this will fix bad UTF-8. This is essentially what omindex
does to deal with this problem.
Cheers,
Olly