FAQ
Hi,

I'm new to the list but I've been using Xapian along with the Ruby
bindings and Xapit for over 1,5 years and it's working great. But now
I've run into a very strange encoding issue.

I'm using Xapian 1.0.11 on Solaris.

This is the issue: I'm pulling ISO-8859-15 encoded data from a legacy
database and I'm indexing it. Some of that data contains German Umlaut
characters. When I search for those words, Xapian finds nothing. That
should not surprise me since the docs say that Xapian expects UTF-8
encoded strings. So I use Iconv to convert the strings from
ISO-8859-15 to UTF-8 before I pass it to Xapian to be indexed: It
still doesn't work. The weird thing is, however, that when I just put
a UTF-8 string literal into my ruby code and return it in place of the
actual string that should be indexed, it works. Even with Umlauts. So
a UTF-8 String LITERAL works, but a UTF-8 String that has been
converted from ISO-8859-15 does not.

Does this sound familiar to anyone? Any help would be appreciated!

- Johannes

--
springenwerk.com | github.com/jfahrenkrug | twitter.com/jfahrenkrug

Search Discussions

  • Johannes Fahrenkrug at Jan 25, 2011 at 1:50 am
    Hi,

    After some more digging it seems to have to do with capital Umlauts.
    So when I index the UTF-8 String "?gypten", I can search for "?gypten"
    and for "?gypten" and get results for both searches.

    But when I index the UTF-8 string "?gypten", I don't get any results,
    whether I search for "?gypten" or for "?gypten".

    Is that a bug or am I missing something?

    Cheers,

    Johannes

    On Mon, Jan 24, 2011 at 2:07 PM, Johannes Fahrenkrug
    wrote:
    Hi,

    I'm new to the list but I've been using Xapian along with the Ruby
    bindings and Xapit for over 1,5 years and it's working great. But now
    I've run into a very strange encoding issue.

    I'm using Xapian 1.0.11 on Solaris.

    This is the issue: I'm pulling ISO-8859-15 encoded data from a legacy
    database and I'm indexing it. Some of that data contains German Umlaut
    characters. When I search for those words, Xapian finds nothing. That
    should not surprise me since the docs say that Xapian expects UTF-8
    encoded strings. So I use Iconv to convert the strings from
    ISO-8859-15 to UTF-8 before I pass it to Xapian to be indexed: It
    still doesn't work. The weird thing is, however, that when I just put
    a UTF-8 string literal into my ruby code and return it in place of the
    actual string that should be indexed, it works. Even with Umlauts. So
    a UTF-8 String LITERAL works, but a UTF-8 String that has been
    converted from ISO-8859-15 does not.

    Does this sound familiar to anyone? Any help would be appreciated!

    - Johannes

    --
    springenwerk.com | github.com/jfahrenkrug | twitter.com/jfahrenkrug


    --
    springenwerk.com | github.com/jfahrenkrug | twitter.com/jfahrenkrug
  • Adam Sjøgren at Jan 25, 2011 at 11:48 am

    On Mon, 24 Jan 2011 17:50:59 -0800, Johannes wrote:

    After some more digging it seems to have to do with capital Umlauts.
    So when I index the UTF-8 String "?gypten", I can search for "?gypten"
    and for "?gypten" and get results for both searches.
    But when I index the UTF-8 string "?gypten", I don't get any results,
    whether I search for "?gypten" or for "?gypten".
    Is that a bug or am I missing something?
    It sounds like the ? isn't being lowercased as it should at indexing
    time, so the ? gets included in the "prefix" of the term rather than the
    "content" (so the index gets something like Z?gypt instead of Z?gypt,
    and you'll get a hit if you search for "gypten").

    How is Ruby's support for lowercasing of utf-8 chars?


    Just a guess,

    Adam

    --
    "H?r kommer r?dslan, gamle v?n Adam Sj?gren
    N?r alla fj?rilar i magen vaknar upp asjo at koldfront.dk
    Viskar v?lkommen hem"
  • Johannes Fahrenkrug at Jan 26, 2011 at 12:49 am
    Hi Adam,
    It sounds like the ? isn't being lowercased as it should at indexing
    How is Ruby's support for lowercasing of utf-8 chars?
    Ruby's UTF-8 "support" is a joke in 1.8.x. But that was exactly the
    problem. The "downcase" method of Ruby's String class didn't downcase
    UTF-8 characters. There are two ways to get around it: If you're using
    Rails, use "a string".mb_chars.downcase. Otherwise, require the
    "unicode" gem and use Unicode::downcase("a string").

    Cheers,

    Johannes
    ?Just a guess,

    ? ?Adam

    --
    ?"H?r kommer r?dslan, gamle v?n ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Adam Sj?gren
    ?N?r alla fj?rilar i magen vaknar upp ? ? ? ? ? ? ? ? ? asjo at koldfront.dk
    ?Viskar v?lkommen hem"


    _______________________________________________
    Xapian-discuss mailing list
    Xapian-discuss at lists.xapian.org
    http://lists.xapian.org/mailman/listinfo/xapian-discuss


    --
    springenwerk.com | github.com/jfahrenkrug | twitter.com/jfahrenkrug
  • Olly Betts at Jan 29, 2011 at 12:17 pm

    On Tue, Jan 25, 2011 at 04:49:56PM -0800, Johannes Fahrenkrug wrote:
    Ruby's UTF-8 "support" is a joke in 1.8.x. But that was exactly the
    problem. The "downcase" method of Ruby's String class didn't downcase
    UTF-8 characters. There are two ways to get around it: If you're using
    Rails, use "a string".mb_chars.downcase. Otherwise, require the
    "unicode" gem and use Unicode::downcase("a string").
    I'm not familiar with xapit, but at the Xapian API level you should be
    able to feed UTF-8 text to Xapian::TermGenerator for indexing, and
    UTF-8 query strings to Xapian::QueryParser for parsing, and case folding
    will be done for you for any characters which have a lowercase
    equivalent in the Unicode tables.

    So I guess you or xapit aren't using TermGenerator and QueryParser (or
    are only using one of them)? If you are, it sounds like a Xapian bug,
    but not one I can reproduce.

    Cheers,
    Olly

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedJan 24, '11 at 10:07p
activeJan 29, '11 at 12:17p
posts5
users3
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2022 Grokbase