FAQ
Hi

I get an unexpected behavior when use wildcards in my queries.
I use a EnglishAnalyzer developed with SnowballAnalyzer. version 1.1_dev
from Lucene in Action lib.

Analysis case:
When use wildcards in the middle of one word, the word in not analyzed.
Examples:

QueryParser qp = new QueryParser("body", analyzer);
Query q = qp.parse("ex?mple");
String strq = q.toString();
assertEquals("body:ex?mpl", strq);
//FAIL strq == body:ex?mple

qp = new QueryParser("body", analyzer);
q = qp.parse("ex*ple");
strq = q.toString();
assertEquals("body:ex*pl", strq);
//FAIL strq == body:ex*ple

With this behavior, the search does not find any document.

Bye
Ernesto.

--
Ernesto De Santis - Colaborativa.net
Córdoba 1147 Piso 6 Oficinas 3 y 4
(S2000AWO) Rosario, SF, Argentina.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erik Hatcher at Mar 31, 2005 at 3:52 pm
    Wildcard terms simply are not analyzed. How could it be possible to do
    this? What if I search for "a*" - how could you stem that?

    Erik
    On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote:

    Hi

    I get an unexpected behavior when use wildcards in my queries.
    I use a EnglishAnalyzer developed with SnowballAnalyzer. version
    1.1_dev from Lucene in Action lib.

    Analysis case:
    When use wildcards in the middle of one word, the word in not analyzed.
    Examples:

    QueryParser qp = new QueryParser("body", analyzer);
    Query q = qp.parse("ex?mple");
    String strq = q.toString();
    assertEquals("body:ex?mpl", strq);
    //FAIL strq == body:ex?mple

    qp = new QueryParser("body", analyzer);
    q = qp.parse("ex*ple");
    strq = q.toString();
    assertEquals("body:ex*pl", strq);
    //FAIL strq == body:ex*ple

    With this behavior, the search does not find any document.

    Bye
    Ernesto.

    --
    Ernesto De Santis - Colaborativa.net
    Córdoba 1147 Piso 6 Oficinas 3 y 4
    (S2000AWO) Rosario, SF, Argentina.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ernesto De Santis at Mar 31, 2005 at 5:24 pm
    Hi Erik

    Ok, in PrefixQuery cases, non analyze is right.

    But you think that non analyze in WildcardQuery is right?

    You search "example" and obtain x results.
    You search "ex?mple" and don't obtain any result.
    This is correct for you?
    It is difficult to analyze wildcard queries in lucene code?

    Ernesto.


    Erik Hatcher escribió:
    Wildcard terms simply are not analyzed. How could it be possible to
    do this? What if I search for "a*" - how could you stem that?

    Erik
    On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote:

    Hi

    I get an unexpected behavior when use wildcards in my queries.
    I use a EnglishAnalyzer developed with SnowballAnalyzer. version
    1.1_dev from Lucene in Action lib.

    Analysis case:
    When use wildcards in the middle of one word, the word in not analyzed.
    Examples:

    QueryParser qp = new QueryParser("body", analyzer);
    Query q = qp.parse("ex?mple");
    String strq = q.toString();
    assertEquals("body:ex?mpl", strq);
    //FAIL strq == body:ex?mple

    qp = new QueryParser("body", analyzer);
    q = qp.parse("ex*ple");
    strq = q.toString();
    assertEquals("body:ex*pl", strq);
    //FAIL strq == body:ex*ple

    With this behavior, the search does not find any document.

    Bye
    Ernesto.

    --
    Ernesto De Santis - Colaborativa.net
    Córdoba 1147 Piso 6 Oficinas 3 y 4
    (S2000AWO) Rosario, SF, Argentina.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --
    Ernesto De Santis - Colaborativa.net
    Córdoba 1147 Piso 6 Oficinas 3 y 4
    (S2000AWO) Rosario, SF, Argentina.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at Apr 1, 2005 at 1:09 am

    On Mar 31, 2005, at 12:26 PM, Ernesto De Santis wrote:
    Hi Erik
    Finally, my name spelled correctly..... :))
    Ok, in PrefixQuery cases, non analyze is right.

    But you think that non analyze in WildcardQuery is right?
    Do I think its right? That's just the way it is. Whether that is
    right or not I don't know for sure. I don't think analyzing a wildcard
    expression is going to do the right thing in most cases - consider
    analyzers that split on special characters like ? and * - in fact I'd
    bet your analyzer currently does that!
    You search "example" and obtain x results.
    You search "ex?mple" and don't obtain any result.
    This is correct for you?
    It is difficult to analyze wildcard queries in lucene code?
    Your free to subclass QueryParser and override getWildcardQuery and
    analyze the term text. I suspect you won't have much success though.
    Please let us know what you find.

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Morus Walter at Apr 1, 2005 at 6:23 am

    Ernesto De Santis writes:
    Hi Erik

    Ok, in PrefixQuery cases, non analyze is right.
    It creates the same problems.
    'example*' should find 'example' but does not if 'example' is stemmed
    to 'exampl' and you don't analyze the prefix query.
    You search "example" and obtain x results.
    You search "ex?mple" and don't obtain any result.
    This is correct for you?
    It is difficult to analyze wildcard queries in lucene code?
    This has nothing to do with lucene code.
    If you can write such an analyzer, do so. Erik already showed you, how
    to integrate it with QP. If you're successful share the code.
    It looks easy for 'ex?ample' but how would you analyze 'exampl?s'?
    Assuming 'examples' get's stemmed to 'exampl' you would have to guess,
    that ? might expand to 'e' and 'exampl?s' should be analyzed to 'exampl'
    and 'exampl?s' (and probably 'exampl?'). Or 'exa*s'?

    IMO you have to either avoid stemming or wildcards or live with the
    different handling.

    Morus

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sven Duzont at Apr 1, 2005 at 1:09 pm
    Hello Erik,

    Since wilcard queries are not analyzed, how can we deal with accents ?
    For instance (in french) a query like "ingé*" will not match documents containing
    "ingénieur" but the query "inge*" will.

    Thanks

    ---
    sven

    Le jeudi 31 mars 2005 à 17:51:25, vous écriviez :

    EH> Wildcard terms simply are not analyzed. How could it be possible to do
    EH> this? What if I search for "a*" - how could you stem that?

    EH> Erik

    EH> On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote:
    Hi

    I get an unexpected behavior when use wildcards in my queries.
    I use a EnglishAnalyzer developed with SnowballAnalyzer. version
    1.1_dev from Lucene in Action lib.

    Analysis case:
    When use wildcards in the middle of one word, the word in not analyzed.
    Examples:

    QueryParser qp = new QueryParser("body", analyzer);
    Query q = qp.parse("ex?mple");
    String strq = q.toString();
    assertEquals("body:ex?mpl", strq);
    //FAIL strq == body:ex?mple

    qp = new QueryParser("body", analyzer);
    q = qp.parse("ex*ple");
    strq = q.toString();
    assertEquals("body:ex*pl", strq);
    //FAIL strq == body:ex*ple

    With this behavior, the search does not find any document.

    Bye
    Ernesto.

    --
    Ernesto De Santis - Colaborativa.net
    Córdoba 1147 Piso 6 Oficinas 3 y 4
    (S2000AWO) Rosario, SF, Argentina.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    EH> ---------------------------------------------------------------------
    EH> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    EH> For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at Apr 1, 2005 at 2:11 pm

    On Apr 1, 2005, at 8:09 AM, Sven Duzont wrote:
    Since wilcard queries are not analyzed, how can we deal with accents ?
    For instance (in french) a query like "ingé*" will not match documents
    containing
    "ingénieur" but the query "inge*" will.
    I presume your analyzer normalized accented characters? Which analyzer
    is that?

    You will need to employ some form of character normalization on
    wildcard queries too.

    Erik

    Thanks

    ---
    sven

    Le jeudi 31 mars 2005 à 17:51:25, vous écriviez :

    EH> Wildcard terms simply are not analyzed. How could it be possible
    to do
    EH> this? What if I search for "a*" - how could you stem that?

    EH> Erik

    EH> On Mar 31, 2005, at 9:51 AM, Ernesto De Santis wrote:
    Hi

    I get an unexpected behavior when use wildcards in my queries.
    I use a EnglishAnalyzer developed with SnowballAnalyzer. version
    1.1_dev from Lucene in Action lib.

    Analysis case:
    When use wildcards in the middle of one word, the word in not
    analyzed.
    Examples:

    QueryParser qp = new QueryParser("body", analyzer);
    Query q = qp.parse("ex?mple");
    String strq = q.toString();
    assertEquals("body:ex?mpl", strq);
    //FAIL strq == body:ex?mple

    qp = new QueryParser("body", analyzer);
    q = qp.parse("ex*ple");
    strq = q.toString();
    assertEquals("body:ex*pl", strq);
    //FAIL strq == body:ex*ple

    With this behavior, the search does not find any document.

    Bye
    Ernesto.

    --
    Ernesto De Santis - Colaborativa.net
    Córdoba 1147 Piso 6 Oficinas 3 y 4
    (S2000AWO) Rosario, SF, Argentina.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    EH>
    ---------------------------------------------------------------------
    EH> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    EH> For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sven Duzont at Apr 1, 2005 at 4:07 pm
    EH> I presume your analyzer normalized accented characters? Which analyzer
    EH> is that?

    Yes, i'm using a custom analyser for indexing / searching, ti consists
    in :
    - FrenchStopFilter
    - IsoLatinFilter (this is the one that will replace accented
    characters)
    - LowerCaseFilter
    - ApostropheFilter (in order to handle terms like with apostrophes,
    for instance "l'expérience" will be decompozed into two tokens : "l" "expérience"

    EH> You will need to employ some form of character normalization on
    EH> wildcard queries too.

    thanks, it works succeffuly, code snippet following

    ---
    sven

    /*----------------------- CODE ----------------------------*/

    private static Query CreateCustomQuery(Query query)
    {
    if(query instanceof BooleanQuery) {
    final BooleanClause[] bClauses = ((BooleanQuery) query).getClauses();

    // The first clause is required
    if(bClauses[0].prohibited != true)
    bClauses[0].required = true;

    // Will parse each clause to remove accents if needed
    Term term;
    for (int i = 0; i < bClauses.length; i++) {
    if(bClauses[i].query instanceof WildcardQuery) {
    term = ((WildcardQuery)bClauses[i].query).getTerm();
    bClauses[i].query = new WildcardQuery(new Term(term.field(),
    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    }
    if(bClauses[i].query instanceof PrefixQuery) {
    term = ((PrefixQuery)bClauses[i].query).getPrefix();
    bClauses[i].query = new PrefixQuery(new Term(term.field(),
    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    // toLowerCase because the text is lowercased during indexation
    }
    }
    }
    else if(query instanceof WildcardQuery) {
    final Term term = ((WildcardQuery)query).getTerm();
    query = new WildcardQuery(new Term(term.field(),
    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    }
    else if(query instanceof PrefixQuery) {
    final Term term = ((PrefixQuery)query).getPrefix();
    query = new PrefixQuery(new Term(term.field(),
    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    }
    return query;
    }

    /*----------------------- END OF CODE ----------------------------*/

    EH> Erik




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at Apr 1, 2005 at 5:14 pm

    On Apr 1, 2005, at 11:07 AM, Sven Duzont wrote:
    EH> I presume your analyzer normalized accented characters? Which
    analyzer
    EH> is that?

    Yes, i'm using a custom analyser for indexing / searching, ti consists
    in :
    - FrenchStopFilter
    - IsoLatinFilter (this is the one that will replace accented
    characters)
    Could you share that filter with the community?
    EH> You will need to employ some form of character normalization on
    EH> wildcard queries too.

    thanks, it works succeffuly, code snippet following

    ---
    sven

    /*----------------------- CODE ----------------------------*/

    private static Query CreateCustomQuery(Query query)
    {
    if(query instanceof BooleanQuery) {
    final BooleanClause[] bClauses = ((BooleanQuery)
    query).getClauses();

    // The first clause is required
    if(bClauses[0].prohibited != true)
    bClauses[0].required = true;
    Why do you flip the required flag like this?
    // Will parse each clause to remove accents if needed
    Term term;
    for (int i = 0; i < bClauses.length; i++) {
    if(bClauses[i].query instanceof WildcardQuery) {
    term = ((WildcardQuery)bClauses[i].query).getTerm();
    bClauses[i].query = new WildcardQuery(new Term(term.field(),

    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    }
    What about handling BooleanQuery's nested within a BooleanQuery?
    You'll need some recursion.

    Erik


    if(bClauses[i].query instanceof PrefixQuery) {
    term = ((PrefixQuery)bClauses[i].query).getPrefix();
    bClauses[i].query = new PrefixQuery(new Term(term.field(),

    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    // toLowerCase because the text is lowercased during indexation
    }
    }
    }
    else if(query instanceof WildcardQuery) {
    final Term term = ((WildcardQuery)query).getTerm();
    query = new WildcardQuery(new Term(term.field(),

    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    }
    else if(query instanceof PrefixQuery) {
    final Term term = ((PrefixQuery)query).getPrefix();
    query = new PrefixQuery(new Term(term.field(),

    ISOLatin1AccentFilter.RemoveAccents(term.text().toLowerCase())));
    }
    return query;
    }

    /*----------------------- END OF CODE ----------------------------*/

    EH> Erik




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sven Duzont at Apr 2, 2005 at 12:01 pm
    Hello,

    EH> What about handling BooleanQuery's nested within a BooleanQuery?
    EH> You'll need some recursion.
    thanks for all hints, i've re-coded the method to handle nested
    BooleanQueries

    EH> Could you share that filter with the community?
    Of course, the code is in the attachment
    // The first clause is required
    if(bClauses[0].prohibited != true)
    bClauses[0].required = true;
    EH> Why do you flip the required flag like this?
    On the search interface, near the keyword field, there is a combo
    with 4 values :
    - KW_MODE_OR : "Search for at least one of the terms"
    - KW_MODE_AND : "Search for all the terms"
    - KW_MODE_PHRASE : "Search for exact phrase"
    - KW_MODE_BOOLEAN : "Search using boolean query" (for advanced users)
    I flip the request field only when boolean expression is selected
    It force the first term to be required so the user will not
    need to specify the "+" or "AND" operator
    Maybe there is a more elegant way to do this ?
    The code is following

    Thanks
    ---
    Sven (is not a bersek)

    */-------------------------------- CODE ---------------------------/*
    // mots clés contenus dans le cv
    if (cvSearchBean.keywords != null &&
    cvSearchBean.keywords.length() != 0) {
    // "Tous les Mot clés" ou "Au moins un des mots clés"
    boolean required = false;
    if ((required = cvSearchBean.keywordModeId == KW_MODE_AND) ||
    cvSearchBean.keywordModeId == KW_MODE_OR) {
    final Query q = CreateCustomQuery(QueryParser.parse(
    cvSearchBean.keywords, FIELD_RESUME_BODY, analyzer));
    if (q instanceof BooleanQuery) {
    final BooleanClause[] terms = ((BooleanQuery) q).getClauses();
    for (int i = 0; i < terms.length; i++) {
    terms[i].prohibited = false;
    terms[i].required = required;
    }
    }
    bQuery.add(q, true, false);
    }
    // Expression exacte
    if (cvSearchBean.keywordModeId == KW_MODE_PHRASE) {
    final PhraseQuery q = new PhraseQuery();
    final TokenStream ts = analyzer.tokenStream(FIELD_RESUME_BODY,
    new StringReader(cvSearchBean.keywords));
    Token token;
    while ((token = ts.next()) != null)
    q.add(new Term(FIELD_RESUME_BODY, token.termText()));
    bQuery.add(q, true, false);
    }
    // Expression booléenne
    if (cvSearchBean.keywordModeId == KW_MODE_BOOLEAN) {
    final Query query = QueryParser.parse(cvSearchBean.title,
    FIELD_RESUME_BODY, analyzer);
    if (query instanceof BooleanQuery) {
    final BooleanClause[] bClauses =
    ((BooleanQuery) query).getClauses();
    if (bClauses[0].prohibited != true)
    bClauses[0].required = true;
    }
    bQuery.add(CreateCustomQuery(query), true, false);
    }

    */--------------------------END OF CODE --------------------------/*



    EH> Erik
  • Erik Hatcher at Apr 3, 2005 at 1:22 pm

    On Apr 2, 2005, at 7:01 AM, Sven Duzont wrote:
    EH> Could you share that filter with the community?
    Of course, the code is in the attachment
    Thanks for sharing that!

    Would you be interested in donating that to the contrib area for
    analyzers? The topic of normalizing accented characters has come up
    often lately. I noticed you already put the Apache license at the top
    of the code.
    // The first clause is required
    if(bClauses[0].prohibited != true)
    bClauses[0].required = true;
    EH> Why do you flip the required flag like this?
    On the search interface, near the keyword field, there is a combo
    with 4 values :
    - KW_MODE_OR : "Search for at least one of the terms"
    - KW_MODE_AND : "Search for all the terms"
    - KW_MODE_PHRASE : "Search for exact phrase"
    - KW_MODE_BOOLEAN : "Search using boolean query" (for advanced users)
    I flip the request field only when boolean expression is selected
    It force the first term to be required so the user will not
    need to specify the "+" or "AND" operator
    Maybe there is a more elegant way to do this ?
    When using QueryParser, you can set the default operator, which is
    normally OR. It will handle setting the first (and every) clause
    appropriately. You'll need to instantiate an instance of QueryParser
    to set that flag (see javadocs for details).

    Erik

    // Expression booléenne
    if (cvSearchBean.keywordModeId == KW_MODE_BOOLEAN) {
    final Query query = QueryParser.parse(cvSearchBean.title,
    FIELD_RESUME_BODY, analyzer);
    if (query instanceof BooleanQuery) {
    final BooleanClause[] bClauses =
    ((BooleanQuery) query).getClauses();
    if (bClauses[0].prohibited != true)
    bClauses[0].required = true;
    }
    bQuery.add(CreateCustomQuery(query), true, false);
    }

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sven Duzont at Apr 3, 2005 at 4:32 pm
    EH> Thanks for sharing that!
    EH> Would you be interested in donating that to the contrib area for
    EH> analyzers? The topic of normalizing accented characters has come up
    EH> often lately. I noticed you already put the Apache license at the top
    EH> of the code.

    No problem, it was intended for the sandbox.

    EH> When using QueryParser, you can set the default operator, which is
    EH> normally OR. It will handle setting the first (and every) clause
    EH> appropriately. You'll need to instantiate an instance of QueryParser
    EH> to set that flag (see javadocs for details).

    Yes, that what i was first thinking of, but they (the end users) wanted
    all clauses except the first to be handled by the 'OR' operator.
    I'll try to convince them that it will make my (and their) life easier
    if the default operator for all clauses is 'AND' ;)

    Regards,

    Sven



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 31, '05 at 2:56p
activeApr 3, '05 at 4:32p
posts12
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase