FAQ
I tried running omindex on the following file, which is a
UTF-8 web page with mixed English and Japanese text.

http://www.mail-archive.com/axis-user-ja@ws.apache.org/msg00058.html

An English query with Omega mostly worked. The only problem was
the summary results were displayed as gibberish - looked like UTF-8
data against a Latin-1 character set. I suspect this issue is easily fixed
by tacking on a UTF-8 META tag in the search interface.

More seriously, Japanese searches didn't seem to work at all. Cutting
and pasting a few words into the browser yielded no results. Additionally,
the UTF-8 quere was escaped into character entity referencess; e.g.
a query for ?? got me a blank result page with the query listed as
皆様

Any comments? I was really surprised, since Omega did so well
in an earlier test against a similar UTF-8 document written in Danish.
Is this a matter of polish or are there deeper barriers, like a lack of
word splitting capability for languages like Chinese/Japanese/Korean?

Search Discussions

  • James Aylett at Aug 10, 2006 at 10:43 am

    On Wed, Aug 09, 2006 at 11:43:34PM -0700, Jeff Breidenbach wrote:

    I tried running omindex on the following file, which is a
    UTF-8 web page with mixed English and Japanese text. [...]
    Any comments? I was really surprised, since Omega did so well
    in an earlier test against a similar UTF-8 document written in Danish.
    Is this a matter of polish or are there deeper barriers, like a lack of
    word splitting capability for languages like Chinese/Japanese/Korean?
    omindex (and the QueryParser) has somewhat primitive,
    European-centric, word splitting. The tricky bit is actually for the
    query parser ... you could either make it so you have to specify the
    language you're searching in, and set splitting and stemming
    appropriately (or auto-detect the language), or parse it all possible
    ways (based on which languages exist in your database) and merge the
    results somehow.

    Ultimately it would be nice to support this kind of thing. The first
    step is UTF-8 support, which Olly has been working on. On top of that
    we'd need word splitting algorithm for CJK (and anything else that we
    can't throw English-like rules at). My understanding is that there
    isn't a good stemming strategy for CJK, so we'd just disable it there.

    Lots of work to make this sort of thing work automatically. If anyone
    knows about word splitting for CJK, that'd be a huge help ...

    James

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Reini Urban at Aug 10, 2006 at 8:25 pm

    James Aylett schrieb:
    On Wed, Aug 09, 2006 at 11:43:34PM -0700, Jeff Breidenbach wrote:

    I tried running omindex on the following file, which is a
    UTF-8 web page with mixed English and Japanese text. [...]
    Any comments? I was really surprised, since Omega did so well
    in an earlier test against a similar UTF-8 document written in Danish.
    Is this a matter of polish or are there deeper barriers, like a lack of
    word splitting capability for languages like Chinese/Japanese/Korean?
    omindex (and the QueryParser) has somewhat primitive,
    European-centric, word splitting. The tricky bit is actually for the
    query parser ... you could either make it so you have to specify the
    language you're searching in, and set splitting and stemming
    appropriately (or auto-detect the language), or parse it all possible
    ways (based on which languages exist in your database) and merge the
    results somehow.

    Ultimately it would be nice to support this kind of thing. The first
    step is UTF-8 support, which Olly has been working on. On top of that
    we'd need word splitting algorithm for CJK (and anything else that we
    can't throw English-like rules at). My understanding is that there
    isn't a good stemming strategy for CJK, so we'd just disable it there.

    Lots of work to make this sort of thing work automatically. If anyone
    knows about word splitting for CJK, that'd be a huge help ...
    And what about automatic language detection?
    That would help me also tremendously as I have about 60% english, 20%
    german, 5% french, 5% korean, 5% japanese, and 5% italian.

    Automatic charset detection would of course help also. Aren't there any
    libraries out there?
  • Jeff Breidenbach at Aug 11, 2006 at 5:42 am

    Ultimately it would be nice to support this kind of thing. The first
    step is UTF-8 support, which Olly has been working on.
    Omega is producing excellent results for French/Italian/German/Spanish
    against UTF-8 HTML files.
    Lots of work to make this sort of thing work automatically. If anyone
    knows about word splitting for CJK, that'd be a huge help ...
    Chinese is easy; one word per character. I suspect a basic unicode
    lookup table (e.g. this range is for Chinese characters) and a few simple
    rules would go a long way.

    However, if serious expertise is desired I suspect Toshiyuki Kimura
    (toshi AT apache.org) might be willing to answer questions. Or will be
    able to refer someone.
    And what about automatic language detection?
    That would help me also tremendously as I have about 60% english, 20%
    german, 5% french, 5% korean, 5% japanese, and 5% italian.
    Not sure why this is useful, except perhaps for stemming. Even then
    you will be in trouble for mixed language documents. Seems a little
    outside the scope of Xapian, at least from my newbie perspective.
    Anyway, there's a couple n-gram based language detectors in open
    source land which work fairly well, but the error rate is noticible.
    Automatic charset detection would of course help also. Aren't there any
    libraries out there?
    Not that I know about, except for more probablistic n-gram stuff
    that's even flakier than language detection. I thought UTF-8 solved
    this problem. Documents not using unicode? The horror!!
  • James Aylett at Aug 18, 2006 at 12:27 pm

    On Thu, Aug 10, 2006 at 09:42:01PM -0700, Jeff Breidenbach wrote:

    Ultimately it would be nice to support this kind of thing. The first
    step is UTF-8 support, which Olly has been working on.
    Omega is producing excellent results for French/Italian/German/Spanish
    against UTF-8 HTML files.
    Is that with the patch? Unless I've missed something, I don't think we
    have released support yet.
    Lots of work to make this sort of thing work automatically. If anyone
    knows about word splitting for CJK, that'd be a huge help ...
    Chinese is easy; one word per character. I suspect a basic unicode
    lookup table (e.g. this range is for Chinese characters) and a few simple
    rules would go a long way.
    Erm ... lookup tables aren't good for unicode are they? I'd have
    thought a tiny predicate function that just checks the codepoint
    against known Chinese ranges would work better.
    And what about automatic language detection?
    That would help me also tremendously as I have about 60% english, 20%
    german, 5% french, 5% korean, 5% japanese, and 5% italian.
    Not sure why this is useful, except perhaps for stemming. Even then
    you will be in trouble for mixed language documents. Seems a little
    outside the scope of Xapian, at least from my newbie perspective.
    Anyway, there's a couple n-gram based language detectors in open
    source land which work fairly well, but the error rate is noticible.
    The 'right' way is to mark up the language use in the
    document. However language detectors as a fallback would be neat. In
    general the way of approaching this that I'd favour would be to
    generate scriptindex input files, so your indexing setup is the only
    thing that needs to care about whether you're using detection or
    reading it out of xml:lang attributes or whatever.
    Automatic charset detection would of course help also. Aren't there any
    libraries out there?
    Not that I know about, except for more probablistic n-gram stuff
    that's even flakier than language detection. I thought UTF-8 solved
    this problem. Documents not using unicode? The horror!!
    <grins>

    Firefox's auto-detection of charsets is regularly fooled. If you
    assume utf-8, iso-8859-1, they some multibyte options you might be
    able to do it, but it's not a good idea. Again, marking up the
    document explicitly is always going to be better.

    James

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Jeff Breidenbach at Aug 19, 2006 at 4:13 am

    Omega is producing excellent results for French/Italian/German/Spanish
    against UTF-8 HTML files.
    Is that with the patch? Unless I've missed something, I don't think we
    have released support yet.
    I tried 0.9.6 unpatched, and also with xapian-qp-utf8-0.9.5.patch
    applied. Either way, UTF-8 Danish worked fine, UTF-8 Japanese was
    a disaster. Although Japanese got a little better once I the put a
    META tag in the search form to tell the browser to think in UTF-8.
    Previously Firefox was ... converting the Japanese query string into
    HTML numerical entity references at form submission time!

    http://www.mail-archive.com/cgi-bin/omega/omega?P=alts%C3%A5&DB=brygforum%40lists.haandbryg.dk
    http://www.mail-archive.com/cgi-bin/omega/omega?P=%E6%A7%98&DB=axis-user-ja%40ws.apache.org
    Erm ... lookup tables aren't good for unicode are they? I'd have
    thought a tiny predicate function that just checks the codepoint
    against known Chinese ranges would work better.
    You are right, I recently read more about unicode and became very,
    very, very scared. The 'NJ' codepoint for Croation is completely insane!
    I'm pretty much terrified to do anything other than library calls. And I don't
    see a word break library call in glib/gunicode.h
  • James Aylett at Aug 19, 2006 at 12:46 pm

    On Fri, Aug 18, 2006 at 08:13:50PM -0700, Jeff Breidenbach wrote:

    I tried 0.9.6 unpatched, and also with xapian-qp-utf8-0.9.5.patch
    applied. Either way, UTF-8 Danish worked fine, UTF-8 Japanese was
    a disaster.
    I suspect that's largely luck, then.
    Although Japanese got a little better once I the put a META tag in
    the search form to tell the browser to think in UTF-8. Previously
    Firefox was ... converting the Japanese query string into HTML
    numerical entity references at form submission time!
    Yeah, as far as I'm aware there's no standard on what you should do if
    your form enctype (which tends to default to the document charset,
    which is daft but there you go) can't cope with characters you're
    submitting. Note that multipart/form-data copes with this properly,
    because it allows different form fields to have different encodings.
    Erm ... lookup tables aren't good for unicode are they? I'd have
    thought a tiny predicate function that just checks the codepoint
    against known Chinese ranges would work better.
    You are right, I recently read more about unicode and became very,
    very, very scared. The 'NJ' codepoint for Croation is completely
    insane! I'm pretty much terrified to do anything other than library
    calls. And I don't see a word break library call in glib/gunicode.h
    Unicode is big and complex, but that's because it's trying to do an
    insanely difficult job. I have Unicode 3.0 at work, and I have
    actually read pretty much all the rules and bits and pieces. Of
    course, now it's got that little bit more complex... :)

    Tim Bray recommended some good resources for starting to grok unicode
    a while back; they should be findable on his blog.

    James

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Olly Betts at Aug 26, 2006 at 4:09 pm

    On Sat, Aug 19, 2006 at 12:46:54PM +0100, James Aylett wrote:
    Yeah, as far as I'm aware there's no standard on what you should do if
    your form enctype (which tends to default to the document charset,
    which is daft but there you go) can't cope with characters you're
    submitting. Note that multipart/form-data copes with this properly,
    because it allows different form fields to have different encodings.
    The problem is that search forms usually want to use METHOD=GET so
    that users can bookmark the results page.

    The best approach I've found is to simply ensure that the document with
    the search form in hadsan encoding which can handle all unicode
    characters. UTF-8 is the best choice since at least unaccented latin
    characters appear in human readable form in the query URL.

    Cheers,
    Olly
  • James Aylett at Aug 27, 2006 at 3:18 pm

    On Sat, Aug 26, 2006 at 04:09:52PM +0100, Olly Betts wrote:

    Yeah, as far as I'm aware there's no standard on what you should do if
    your form enctype (which tends to default to the document charset,
    which is daft but there you go) can't cope with characters you're
    submitting. Note that multipart/form-data copes with this properly,
    because it allows different form fields to have different encodings.
    The problem is that search forms usually want to use METHOD=GET so
    that users can bookmark the results page.
    Indeed :-/
    The best approach I've found is to simply ensure that the document with
    the search form in hadsan encoding which can handle all unicode
    characters. UTF-8 is the best choice since at least unaccented latin
    characters appear in human readable form in the query URL.
    UTF-8 is generally the right encoding to use for most general purpose
    applications these days. It is its problems, not least the political
    ones (largely inherited from Unicode), but there is good support
    around and it deals with a lot more of the problems than anything else
    I'm aware of.

    If you are charset=utf-8 (which you certainly should be for XHTML, and
    is a very good idea for HTML 4), then your HTML forms should transmit
    all Unicode code points through successfully.

    Whether you can conveniently work with them in the backend is another
    matter entirely (although Python, Java, C# and C++ all make this
    fairly easy; it's possible but sometimes awkward in PHP. Ruby again
    it's possible (my understanding is that there are some good libraries,
    and detailed Unicode support is now being designed in).

    James

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Jeff Breidenbach at Aug 19, 2006 at 6:30 pm

    Unless I've missed something, I don't think we have released
    [UTF-8] support yet.
    Is there a plan or a roadmap or a guesstimate?
  • James Aylett at Aug 19, 2006 at 6:34 pm

    On Sat, Aug 19, 2006 at 10:30:45AM -0700, Jeff Breidenbach wrote:

    Unless I've missed something, I don't think we have released
    [UTF-8] support yet.
    Is there a plan or a roadmap or a guesstimate?
    Not really. I can't remember the detail, but since we'll need new
    Snowball stemmers, it's probably a 1.0 thing (although bits of the
    support can land before then). Olly's the only person who can really
    answer this kind of question, and he's on holiday at the moment.

    James

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Olly Betts at Aug 26, 2006 at 3:54 pm

    On Thu, Aug 10, 2006 at 09:42:01PM -0700, Jeff Breidenbach wrote:
    Ultimately it would be nice to support this kind of thing. The first
    step is UTF-8 support, which Olly has been working on.
    Omega is producing excellent results for French/Italian/German/Spanish
    against UTF-8 HTML files.
    I think the problem you're seeing with Japanese is with omindex's term
    generation. The omega CGI should work fine, but I didn't patch omindex
    yet as I have a separate indexer for gmane which knows about UTF-8.

    The plan is to have everything working with UTF-8 for Xapian 1.0, so
    this is actually going to be my main focus once I've dealt with my
    email backlog.
    Lots of work to make this sort of thing work automatically. If anyone
    knows about word splitting for CJK, that'd be a huge help ...
    Chinese is easy; one word per character. I suspect a basic unicode
    lookup table (e.g. this range is for Chinese characters) and a few simple
    rules would go a long way.
    Chinese isn't really as simple as one word per character. Chinese
    characters are themselves words, but many words are formed from multiple
    characters. For example, the Chinese capital Beijing is formed from two
    characters (which literally mean something like "North Capital").
    And what about automatic language detection?
    Not sure why this is useful, except perhaps for stemming. Even then
    you will be in trouble for mixed language documents.
    I think it only really matters for stemming, and also if you want to
    allow users to filter a query to just show results in a particular
    language.

    While different languages may have different ideas of word splitting, I
    think in reality there are few, if any, characters which are a word
    character in one language but a word breaking character in another.

    For gmane, I just don't use stemming currently. As you say, the error
    rate of language identifiers is noticeable, and they perform worse
    on very short texts, so you really can't use them to detect what
    language a query is in.

    Cheers,
    Olly
  • Fabrice Colin at Aug 11, 2006 at 12:23 pm

    On 8/11/06, "Jeff Breidenbach" wrote:
    And what about automatic language detection?
    That would help me also tremendously as I have about 60% english, 20%
    german, 5% french, 5% korean, 5% japanese, and 5% italian.
    Not sure why this is useful, except perhaps for stemming. Even then
    you will be in trouble for mixed language documents. Seems a little
    outside the scope of Xapian, at least from my newbie perspective.
    Anyway, there's a couple n-gram based language detectors in open
    source land which work fairly well, but the error rate is noticible.
    I am using libtextcat (http://software.wise-guys.nl/libtextcat/) for Pinot.
    It's pretty accurate, at least with the few European languages I tried.
    Korean and Japanese are supported too apparently...

    Fabrice
  • Jeff Breidenbach at Aug 13, 2006 at 5:34 am
    This is looking promising. Running down my Omega checklist:

    * The patch is still too crude to submit, but I'v beaten htmlparse.cc
    into respecting <!--htdig_noindex--><!--/htdig_noindex-->

    * I've located the 300 character limit on sample size in omindex.cc,
    but am leaving that alone for the time being. Will keep in mind for
    improving summary results later. [1]

    * Getting filesize and last modification date in summary results is
    nice to have, but not critical. Putting on backburner.

    * I'm now building some flint indices for testing. This will probably
    take about a week to complete. When finished, this may provide
    some interesting benchmarks.

    * How can I best help with CJK ? The more concrete the suggestion,
    the better.

    Cheers,
    Jeff

    [1] http://lists.tartarus.org/pipermail/xapian-discuss/2006-January/001471.html
  • Reini Urban at Aug 13, 2006 at 8:50 am

    2006/8/13, Jeff Breidenbach <breidenbach@gmail.com>:
    This is looking promising. Running down my Omega checklist:

    * The patch is still too crude to submit, but I'v beaten htmlparse.cc
    into respecting <!--htdig_noindex--><!--/htdig_noindex-->

    * I've located the 300 character limit on sample size in omindex.cc,
    but am leaving that alone for the time being. Will keep in mind for
    improving summary results later. [1]

    * Getting filesize and last modification date in summary results is
    nice to have, but not critical. Putting on backburner.

    * I'm now building some flint indices for testing. This will probably
    take about a week to complete. When finished, this may provide
    some interesting benchmarks.
    For first tests and benchmarks it's much better with smaller databases.
    I run a make check in a small test subdir, with all major document types only,
    for flint and quartz.

    And the bench runs I do with 30.000 docs, which need ~20min on cygwin,
    without last_mod check and with last_mod check and cached extracted
    "virtual dirs" (zip, msg, ...) about 2 min.
    cygwin and pdftotext and xls2csv are slow.
    --
    Reini Urban
    Racing Simu and Support
    AVL List GesmbH Graz
  • James Aylett at Aug 19, 2006 at 9:58 pm

    On Sat, Aug 12, 2006 at 09:34:50PM -0700, Jeff Breidenbach wrote:

    * Getting filesize and last modification date in summary results is
    nice to have, but not critical. Putting on backburner.
    I'd certainly favour having them in. Last mod is straightforward (did
    I suggest code for it? can't remember); filesize is a little more
    awkward, because at the moment the stat() result isn't passed down to
    index_file(), just the last mod time. A fairly easy enhancement,
    though.

    James

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Olly Betts at Aug 27, 2006 at 1:29 am

    On Sat, Aug 12, 2006 at 09:34:50PM -0700, Jeff Breidenbach wrote:
    * The patch is still too crude to submit, but I'v beaten htmlparse.cc
    into respecting <!--htdig_noindex--><!--/htdig_noindex-->
    Oops, I've already committed a patch for this.
    * Getting filesize and last modification date in summary results is
    nice to have, but not critical. Putting on backburner.
    These were trivial to add, so I've just done them.
    * How can I best help with CJK ? The more concrete the suggestion,
    the better.
    One useful job which doesn't require particular knowledge of Xapian is
    to check all the filtering tools which omindex can use and discover the
    runes required to get them to produce UTF-8 output (or failing that,
    UTF-16 or UTF-32 but I suspect Unix tools are more likely to produce
    UTF-8 if they do unicode at all).

    If any can't, seek out alternative tools which can and check if they do
    as good a job of dumping text for indexing. Failing that, we can still
    support formats where the convertors only support iso-8859-1 by just
    converting the output (perhaps some formats don't support unicode
    anyway).

    If any other tasks come to mind, I'll let the list know.

    Cheers,
    Olly
  • Olly Betts at Sep 6, 2006 at 1:44 pm

    On Sun, Aug 27, 2006 at 01:29:01AM +0100, Olly Betts wrote:
    On Sat, Aug 12, 2006 at 09:34:50PM -0700, Jeff Breidenbach wrote:
    * How can I best help with CJK ? The more concrete the suggestion,
    the better.
    One useful job which doesn't require particular knowledge of Xapian is
    to check all the filtering tools which omindex can use and discover the
    runes required to get them to produce UTF-8 output (or failing that,
    UTF-16 or UTF-32 but I suspect Unix tools are more likely to produce
    UTF-8 if they do unicode at all).
    I've now pretty much done this. The worst gap is that there doesn't
    seem to be a PostScript to text convertor which handles anything above
    iso-8859-1.

    The current state of my reworked code is that everything is being
    converted to UTF-8 in omindex. That mostly leaves adjusting the
    word tokenisation in line with the UTF-8 QueryParser patch, and
    deciding what to do about character sets in scriptindex.

    Further work is also still needed to handle wide character HTML files
    (such as UTF-16), or indeed HTML in any encoding which doesn't have
    ASCII as a subset.

    Anyway, I'll create a "unicode" branch in SVN soon so people can try out
    the new code.

    Cheers,
    Olly
  • Jeff Breidenbach at Sep 7, 2006 at 3:55 am

    I've now pretty much done this. The worst gap is that there doesn't
    seem to be a PostScript to text convertor which handles anything above
    iso-8859-1.
    Hmm... maybe go from postscript to pdf, the extract the text from
    there. I don't think there are a lot of viable alternatives to ghostscipt.
    Sorry I didn't dive in to help fast enough.
    Anyway, I'll create a "unicode" branch in SVN soon so people can try out
    the new code.
    Cool! I bet there will be a number of tiny things to clean up, like cutting
    offf summary indexes at N utf-8 characters instead of at N bytes. Nothing
    that won't shake out pretty quickly.

    Jeff

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedAug 10, '06 at 7:43a
activeSep 7, '06 at 3:55a
posts19
users5
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2022 Grokbase