Hi,

I'm trying to adapt the gmane indexer and variant of omega (thankfully
made available by GMANE) to search through a set of ~3M messages in all
sorts of languages. A couple of questions I could readily find answers to.

Does a WritableDatabase with several DBs added have defined/stable
semantics as to where documents are stored upon replace_document?
I would want to be able to re-index parts of the archive and to replace
messages.
Or is it preferable to have some sort of external partition of what is
in which DB?

Also, is there a way (short of patching but configuration rather than
passing parameters) to make omega search through multiple databases?

Finally, is there a simple good way of searching a database with
documents stemmed in different languages? The two naive ideas I could
come up with is split the index into databases by language or search
with something like
OR_{lang in languages} (queriy_stemmed_for_lang AND LANG=lang)...

Kind regards

Thomas
--
Thomas Viehmann, http://thomas.viehmann.net/

Search Discussions

  • Olly Betts at Nov 28, 2007 at 3:10 pm

    On Mon, Nov 26, 2007 at 09:43:08PM +0100, Thomas Viehmann wrote:
    Does a WritableDatabase with several DBs added have defined/stable
    semantics as to where documents are stored upon replace_document?
    WritableDatabase only supports a single sub-database at present.
    Nothing actually prevents you calling add_database() to add more, but
    the results are entirely undefined.

    It would be nice to make this work in a sensible way though. Then
    you could split indexing load with a WritableDatabase which round-robins
    updates across several remote servers.
    I would want to be able to re-index parts of the archive and to replace
    messages.
    Or is it preferable to have some sort of external partition of what is
    in which DB?
    Currently that is what you need to do. If you commonly want to search
    over subsets of the data, this approach probably is better anyway.
    Also, is there a way (short of patching but configuration rather than
    passing parameters) to make omega search through multiple databases?
    It's only currently supported by passing multiple DB parameters, or a
    single DB parameter with a list of database names separated by "/".
    Finally, is there a simple good way of searching a database with
    documents stemmed in different languages? The two naive ideas I could
    come up with is split the index into databases by language or search
    with something like
    OR_{lang in languages} (queriy_stemmed_for_lang AND LANG=lang)...
    This sort of multi-language search is a problem I've seen come up a
    number of times over the years I've been involved in search, and I've
    yet to see a totally satisfactory solution.

    You can determine the language of a document pretty reliably (e.g. look
    at the textcat library), but a query string is often too short to make
    a reliable determination. Some queries are ambiguous as they make sense
    in multiple languages.

    If you can, I think it's best to sidestep these problems and set up your
    UI so that the user actually specifies (explicitly or implicitly) what
    language their query is in. Then search a database of documents in just
    that language (since you can identify these reliably enough).

    Cheers,
    Olly

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedNov 26, '07 at 8:43p
activeNov 28, '07 at 3:10p
posts2
users2
websitexapian.org
irc#xapian

2 users in discussion

Olly Betts: 1 post Thomas Viehmann: 1 post

People

Translate

site design / logo © 2021 Grokbase