FAQ

[Xapian-discuss] STEM_SOME and prefixes.. (even boolean)

Jesper Krogh
May 15, 2008 at 11:50 am
Hi.

This seems somehow a bit strange. And I cant really see if it is a bug
or a "feature" but:

I have acc listed as a boolean prefix. I use STEM_SOME since that seem
to be the most useful way of doing stuff. But it would be really nice if
we'd either stemmed all prefixes or we didn't.

I have some terms like Q1W2E3 that is listed as boolean prefixes. This is
essential ID's.. so I really dont want the stemming algorithm to
accidentally
stumble over them. But then if the id happens to be start with an
upper-case letter it gets fed to the search like this:
Search:
acc:Q1W2E3

Running query 'Xapian::Query(0 * ACC:Q1W2E3)'

As far as I can tell the query with a : will never match anything in the
index?

Xapian 1.0.5

Jesper

--
Jesper Krogh
reply

Search Discussions

5 responses

  • Matthew Somerville at May 15, 2008 at 4:35 pm

    Jesper Krogh wrote:
    I have acc listed as a boolean prefix.
    Do you mean you have something like:
    $queryparser->add_boolean_prefix('acc', 'Q');
    or something else?
    I have some terms like Q1W2E3 that is listed as boolean prefixes.
    Do you mean you have a document in your database that has Q1W2E3 as a term?
    I'm guessing not because of what you say below, so what is the term you have
    entered in the database for the ID "Q1W2E3"?
    This is essential ID's.. so I really dont want the stemming algorithm to
    accidentally stumble over them. But then if the id happens to be start with an
    upper-case letter it gets fed to the search like this:
    Search:
    acc:Q1W2E3

    Running query 'Xapian::Query(0 * ACC:Q1W2E3)'
    This doesn't sound like a stemming issue (though I could be wrong :) ). If I
    have "acc" as a boolean prefix here with the above queryparser line, a query
    for acc:Q1W2E3 to QueryParser becomes:
    Xapian::Query(0 * QQ1W2E3)
    and if I don't have "acc" as a boolean prefix, it becomes:
    Xapian::Query((acc:(pos=1) PHRASE 2 q1w2e3:(pos=2)))
    ie. it's treated as a phrase search.

    Do you have some short example code that exhibits the issue?

    ATB,
    Matthew
  • Jesper Krogh at May 15, 2008 at 4:47 pm

    Matthew Somerville wrote:
    Jesper Krogh wrote:
    I have acc listed as a boolean prefix.
    Do you mean you have something like:
    $queryparser->add_boolean_prefix('acc', 'Q');
    or something else?
    I have some terms like Q1W2E3 that is listed as boolean prefixes.
    Do you mean you have a document in your database that has Q1W2E3 as a
    term? I'm guessing not because of what you say below, so what is the
    term you have entered in the database for the ID "Q1W2E3"?
    This is essential ID's.. so I really dont want the stemming algorithm to
    accidentally stumble over them. But then if the id happens to be start
    with an
    upper-case letter it gets fed to the search like this:
    Search:
    acc:Q1W2E3

    Running query 'Xapian::Query(0 * ACC:Q1W2E3)'
    This doesn't sound like a stemming issue (though I could be wrong :) ).
    If I have "acc" as a boolean prefix here with the above queryparser
    line, a query for acc:Q1W2E3 to QueryParser becomes:
    Xapian::Query(0 * QQ1W2E3)
    and if I don't have "acc" as a boolean prefix, it becomes:
    Xapian::Query((acc:(pos=1) PHRASE 2 q1w2e3:(pos=2)))
    ie. it's treated as a phrase search.
    I have both behaviors if I put in q1w2e3 instead of Q1W2E3, that was
    why i thought it was related to the stemming.
    Do you have some short example code that exhibits the issue?
    Have you set stemming strategy to STEM_SOME?

    Jesper

    --
    Jesper
  • Matthew Somerville at May 16, 2008 at 10:21 am

    Jesper Krogh wrote:
    Have you set stemming strategy to STEM_SOME?
    Yes; but I only have single character boolean prefixes as in the example
    code I gave. From Olly's mail, it sounds like you are doing something like:
    $queryparser->add_boolean_prefix('acc', 'ACC');
    ?

    And then QueryParser doesn't know how to spot where the prefix ends and the
    term begins. So you can either switch to a single letter prefix in both
    indexer and query (that's what we have), or change your indexer to add the
    ACC terms to the database with the colon.

    I guess something about multi-letter prefixes should go on the wiki/docs,
    but I'm not sure where would be the best place.

    ATB,
    Matthew
  • Olly Betts at May 16, 2008 at 10:57 am

    On Fri, May 16, 2008 at 11:21:15AM +0100, Matthew Somerville wrote:
    I guess something about multi-letter prefixes should go on the wiki/docs,
    but I'm not sure where would be the best place.
    It is actually documented already, though possibly only in the Omega
    documentation:

    http://xapian.org/docs/omega/termprefixes.html

    Cheers,
    Olly
  • Olly Betts at May 15, 2008 at 6:08 pm

    On Thu, May 15, 2008 at 01:50:16PM +0200, Jesper Krogh wrote:
    Search:
    acc:Q1W2E3

    Running query 'Xapian::Query(0 * ACC:Q1W2E3)'

    As far as I can tell the query with a : will never match anything in the
    index?
    The issue here is that given the term ACCQ1W2E3, how do you say what the
    prefix is? You're wanting it to be ACC, but it could be ACCQ, AC, or
    just A.

    So when adding a multi-character term prefix, we insert a ':' if the
    term starts with a capital so that the prefix/term boundary isn't lost.
    Obviously this needs to happen at index time too, or as you say the term
    with the colon will never match.

    There's also an assumption in some places that you follow the convention
    that multicharacter prefixes only start with 'X' (I think only in Omega
    but I'm not certain).

    Cheers,
    Olly

Related Discussions

Discussion Navigation
viewthread | post