FAQ
In reading the documentation for escape characters, I'm having a
little trouble understanding what it wants me to do for certain
special cases.

http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Characters
says: "Lucene supports escaping special characters that are part of
the query syntax. The current list special characters are: + - && ||
! ( ) { } [ ] ^ " ~ * ? : \ To escape these character use the \
before the character."

Specifically, I'm curious about the double characters && and || and
how they should be properly escaped.

Experimentation showed some very strange things with the StandardAnalyzer.

Using Luke, I get some interesting mappings.
AT&T becomes at&t (as expected)
AT&&T becomes t (tricky... at is now taken as a stop word; fine
makes sense)

..but what about... "AT&&T" ...nope, still t.

AAA&BBB becomes aaa&bbb ...correct
AAA&&BBB becomes aaa bbb ...ampersand becomes a space?
"AAA&&BBB" is also aaa bbb

AAA\&BBB correctly is aaa&bbb ...just as before
AAA\&&BBB is aaa bbb ...but perhaps we got the escape wrong.

Is '&&' special "character" and is it escaped as \&& or escaped as
\&\& ...let's find out.

AAA\&\&BBB is also aaa bbb ...perhaps we need quotes?
"AAA\&\&BBB" is also aaa bbb ...I can't seem to get the escape to work.

How about this?
AAA&BBB&CCC strangely becomes aaa&bbb ccc

Even when escaped?
AAA\&BBB\&CCC is also aaa&bbb ccc ...appears so.

What about...
AAA&BBB&CCC&DDD becomes aaa&bbb ccc&ddd ....whoa, not expecting that.

AAA&&BBB&&CCC&&DDD becomes aaa bbb ccc ddd ...if && means AND, ok...

AAA\&&BBB\&&CCC\&&DDD no change aaa bbb ccc ddd

AAA\&\&BBB\&\&CCC\&\&DDD also no change aaa bbb ccc ddd


It appears I literally cannot search for the token with two ampersands
in it, whether they are touching or not.

Clearly I'm missing something. Is there a way to get any literal
sequence of my choosing, using escapes, as a term in the Lucene
expression?

-Walt Stoneburner

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Yonik Seeley at Jul 13, 2007 at 3:47 pm
    I just tried some things fast via the Solr admin interface, and
    everything seems fine.
    I think you are probably confusing what the parser does vs what the
    analyzer does.
    Try your tests with an un-tokenized field to remove that effect.

    -Yonik
    On 7/13/07, Walt Stoneburner wrote:
    In reading the documentation for escape characters, I'm having a
    little trouble understanding what it wants me to do for certain
    special cases.

    http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Characters
    says: "Lucene supports escaping special characters that are part of
    the query syntax. The current list special characters are: + - && ||
    ! ( ) { } [ ] ^ " ~ * ? : \ To escape these character use the \
    before the character."

    Specifically, I'm curious about the double characters && and || and
    how they should be properly escaped.

    Experimentation showed some very strange things with the StandardAnalyzer.

    Using Luke, I get some interesting mappings.
    AT&T becomes at&t (as expected)
    AT&&T becomes t (tricky... at is now taken as a stop word; fine
    makes sense)

    ..but what about... "AT&&T" ...nope, still t.

    AAA&BBB becomes aaa&bbb ...correct
    AAA&&BBB becomes aaa bbb ...ampersand becomes a space?
    "AAA&&BBB" is also aaa bbb

    AAA\&BBB correctly is aaa&bbb ...just as before
    AAA\&&BBB is aaa bbb ...but perhaps we got the escape wrong.

    Is '&&' special "character" and is it escaped as \&& or escaped as
    \&\& ...let's find out.

    AAA\&\&BBB is also aaa bbb ...perhaps we need quotes?
    "AAA\&\&BBB" is also aaa bbb ...I can't seem to get the escape to work.

    How about this?
    AAA&BBB&CCC strangely becomes aaa&bbb ccc

    Even when escaped?
    AAA\&BBB\&CCC is also aaa&bbb ccc ...appears so.

    What about...
    AAA&BBB&CCC&DDD becomes aaa&bbb ccc&ddd ....whoa, not expecting that.

    AAA&&BBB&&CCC&&DDD becomes aaa bbb ccc ddd ...if && means AND, ok...

    AAA\&&BBB\&&CCC\&&DDD no change aaa bbb ccc ddd

    AAA\&\&BBB\&\&CCC\&\&DDD also no change aaa bbb ccc ddd


    It appears I literally cannot search for the token with two ampersands
    in it, whether they are touching or not.

    Clearly I'm missing something. Is there a way to get any literal
    sequence of my choosing, using escapes, as a term in the Lucene
    expression?

    -Walt Stoneburner

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Jul 13, 2007 at 6:40 pm
    This is certainly the case. StandardAnalyzer has a regex matcher that
    looks for a possible company name involving an & or an @. The
    QueryParser is escaping the '&' -- all of the affects described are
    standard results of using the StandardAnalzyer. Any double '&&' will
    break text, but 'sdfdf&dfsdf' will match as a company name. Escaping
    will not affect the matches that StandardAnalyzer tries to make, it will
    just keep the QueryParser from matching the escapee as an operator.

    'sdfdf&dfsdf&sdfd' will match to company name: sdfdf&dfsdf and then
    token: sdfd...the second '&' breaks, the first causes a company match.
    Check out the regex in StandardTokenizer.jj.

    Also, to point out, there is no 'real' literal search in Lucene.
    Anything in quotes gets passed to the Analyzer, so you will get similar
    results whether you use quotes or not.

    - Mark

    Yonik Seeley wrote:
    I just tried some things fast via the Solr admin interface, and
    everything seems fine.
    I think you are probably confusing what the parser does vs what the
    analyzer does.
    Try your tests with an un-tokenized field to remove that effect.

    -Yonik
    On 7/13/07, Walt Stoneburner wrote:
    In reading the documentation for escape characters, I'm having a
    little trouble understanding what it wants me to do for certain
    special cases.

    http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping%20Special%20Characters

    says: "Lucene supports escaping special characters that are part of
    the query syntax. The current list special characters are: + - && ||
    ! ( ) { } [ ] ^ " ~ * ? : \ To escape these character use the \
    before the character."

    Specifically, I'm curious about the double characters && and || and
    how they should be properly escaped.

    Experimentation showed some very strange things with the
    StandardAnalyzer.

    Using Luke, I get some interesting mappings.
    AT&T becomes at&t (as expected)
    AT&&T becomes t (tricky... at is now taken as a stop word; fine
    makes sense)

    ..but what about... "AT&&T" ...nope, still t.

    AAA&BBB becomes aaa&bbb ...correct
    AAA&&BBB becomes aaa bbb ...ampersand becomes a space?
    "AAA&&BBB" is also aaa bbb

    AAA\&BBB correctly is aaa&bbb ...just as before
    AAA\&&BBB is aaa bbb ...but perhaps we got the escape wrong.

    Is '&&' special "character" and is it escaped as \&& or escaped as
    \&\& ...let's find out.

    AAA\&\&BBB is also aaa bbb ...perhaps we need quotes?
    "AAA\&\&BBB" is also aaa bbb ...I can't seem to get the escape
    to work.

    How about this?
    AAA&BBB&CCC strangely becomes aaa&bbb ccc

    Even when escaped?
    AAA\&BBB\&CCC is also aaa&bbb ccc ...appears so.

    What about...
    AAA&BBB&CCC&DDD becomes aaa&bbb ccc&ddd ....whoa, not expecting
    that.

    AAA&&BBB&&CCC&&DDD becomes aaa bbb ccc ddd ...if && means AND, ok...

    AAA\&&BBB\&&CCC\&&DDD no change aaa bbb ccc ddd

    AAA\&\&BBB\&\&CCC\&\&DDD also no change aaa bbb ccc ddd


    It appears I literally cannot search for the token with two ampersands
    in it, whether they are touching or not.

    Clearly I'm missing something. Is there a way to get any literal
    sequence of my choosing, using escapes, as a term in the Lucene
    expression?

    -Walt Stoneburner

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 13, '07 at 3:13p
activeJul 13, '07 at 6:40p
posts3
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase