FAQ
Hello, I'm new here. I've actually started using dotLucene but I think I
need to make a change to the QueryParser but it's so complicated to try and
understand what it's doing I thought I'd ask if maybe one of you guys could
point me in the right direction?

In my implementation of Lucene I have the need to store keywords that are of
the form "<key>:<identity>" for example CI:123. Whilst I can store this in
Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by using
QueryParser which I need to do.

Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
already made a small hack so that non-tokenized values are indexed as
lowercase so at least I can get them back if I use ID:CI\:123 but colons are
commonly used and I really don't want to have to escape them everywhere

What I want to achieve is that query parser will parse ID:CI:123 as
field(ID) value(CI:123). I understand that colon is a special character but
it's only used to delimit fields and values in which case it makes sense to
react to the first colon, the second colon should be treated as part of the
text which the analyzer could strip out or keep (in my case because I'm
using a custom analyzer).

Does this make sense? How do I go about changing the QueryParserTokenManager
to achieve this? Perhaps you can point me to some documentation that
describes the code even?

Any help gratefully received!

Thanks,
Gwyn Carwardine


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Chris Hostetter at Jan 21, 2006 at 6:46 pm
    if you are flexible in the syntax you are willing to support, you can tell
    your users that they need to escape the colons that aren't ment as field
    identifiers...

    ID:CI\:123

    ...alternately, you can tell them they have to quote colons...

    ID:"CI:123"

    ...then you can avoid the whole painfull mess of the parser internals.


    : Date: Sat, 21 Jan 2006 13:10:56 -0000
    : From: Gwyn Carwardine <gwyn@carwardine.net>
    : Reply-To: java-dev@lucene.apache.org
    : To: java-dev@lucene.apache.org
    : Subject: Handling of colons in QueryParserTokenManager
    :
    : Hello, I'm new here. I've actually started using dotLucene but I think I
    : need to make a change to the QueryParser but it's so complicated to try and
    : understand what it's doing I thought I'd ask if maybe one of you guys could
    : point me in the right direction?
    :
    : In my implementation of Lucene I have the need to store keywords that are of
    : the form "<key>:<identity>" for example CI:123. Whilst I can store this in
    : Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by using
    : QueryParser which I need to do.
    :
    : Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
    : already made a small hack so that non-tokenized values are indexed as
    : lowercase so at least I can get them back if I use ID:CI\:123 but colons are
    : commonly used and I really don't want to have to escape them everywhere
    :
    : What I want to achieve is that query parser will parse ID:CI:123 as
    : field(ID) value(CI:123). I understand that colon is a special character but
    : it's only used to delimit fields and values in which case it makes sense to
    : react to the first colon, the second colon should be treated as part of the
    : text which the analyzer could strip out or keep (in my case because I'm
    : using a custom analyzer).
    :
    : Does this make sense? How do I go about changing the QueryParserTokenManager
    : to achieve this? Perhaps you can point me to some documentation that
    : describes the code even?
    :
    : Any help gratefully received!
    :
    : Thanks,
    : Gwyn Carwardine
    :
    :
    : ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-dev-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Gwyn Carwardine at Jan 21, 2006 at 7:20 pm
    I don't want the users to have to use escape characters. I'd rather they
    didn't have to use quotes.

    Of course I think someone needs to go into the internals anyway... on 1.4.3
    I get an index out of array bounds error (not a nice parse exception) when
    it tries to parse the following (which it should be able to do):

    ["fred" TO "joe"]

    Maybe this is fixed in 1.9 but I tried it on the www.lucenebook.com search
    assuming that was using a recent version and that generates a server error!

    It's a real shame that the QueryParserTokenManager had no comments put in to
    explain what on earth it's doing!



    -----Original Message-----
    From: hossman@hal.rescomp.berkeley.edu
    On Behalf Of Chris Hostetter
    Sent: 21 January 2006 18:46
    To: java-dev@lucene.apache.org
    Subject: Re: Handling of colons in QueryParserTokenManager


    if you are flexible in the syntax you are willing to support, you can tell
    your users that they need to escape the colons that aren't ment as field
    identifiers...

    ID:CI\:123

    ...alternately, you can tell them they have to quote colons...

    ID:"CI:123"

    ...then you can avoid the whole painfull mess of the parser internals.


    : Date: Sat, 21 Jan 2006 13:10:56 -0000
    : From: Gwyn Carwardine <gwyn@carwardine.net>
    : Reply-To: java-dev@lucene.apache.org
    : To: java-dev@lucene.apache.org
    : Subject: Handling of colons in QueryParserTokenManager
    :
    : Hello, I'm new here. I've actually started using dotLucene but I think I
    : need to make a change to the QueryParser but it's so complicated to try
    and
    : understand what it's doing I thought I'd ask if maybe one of you guys
    could
    : point me in the right direction?
    :
    : In my implementation of Lucene I have the need to store keywords that are
    of
    : the form "<key>:<identity>" for example CI:123. Whilst I can store this in
    : Lucene using Field.Keyword("ID","CI:123") I can't easily look it up by
    using
    : QueryParser which I need to do.
    :
    : Whenever I parse the query ID:CI:123 it parses it as "ID:ci". Now I've
    : already made a small hack so that non-tokenized values are indexed as
    : lowercase so at least I can get them back if I use ID:CI\:123 but colons
    are
    : commonly used and I really don't want to have to escape them everywhere
    :
    : What I want to achieve is that query parser will parse ID:CI:123 as
    : field(ID) value(CI:123). I understand that colon is a special character
    but
    : it's only used to delimit fields and values in which case it makes sense
    to
    : react to the first colon, the second colon should be treated as part of
    the
    : text which the analyzer could strip out or keep (in my case because I'm
    : using a custom analyzer).
    :
    : Does this make sense? How do I go about changing the
    QueryParserTokenManager
    : to achieve this? Perhaps you can point me to some documentation that
    : describes the code even?
    :
    : Any help gratefully received!
    :
    : Thanks,
    : Gwyn Carwardine
    :
    :
    : ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-dev-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Erik Hatcher at Jan 23, 2006 at 1:39 am

    On Jan 21, 2006, at 2:16 PM, Gwyn Carwardine wrote:
    Of course I think someone needs to go into the internals anyway...
    on 1.4.3
    I get an index out of array bounds error (not a nice parse
    exception) when
    it tries to parse the following (which it should be able to do):

    ["fred" TO "joe"]

    Maybe this is fixed in 1.9 but I tried it on the www.lucenebook.com
    search
    assuming that was using a recent version and that generates a
    server error!
    It does not generate a server error on lucenebook.com:

    <http://www.lucenebook.com/search?query=%5B%22fred%22+TO+%22joe%22%5D>

    Maybe you happened to hit the server at some point when there was an
    issue with the server itself (?), but I just tried it and get plenty
    of results.
    It's a real shame that the QueryParserTokenManager had no comments
    put in to
    explain what on earth it's doing!
    Look at QueryParser.jj - that is where the rest is generated from,
    using JavaCC.

    Your subject mentions colons, but your example doesn't. Besides the
    range query example, is there an issue with colons that you want to
    ask about?

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Gwyn Carwardine at Jan 23, 2006 at 11:29 am
    Thanks for your reply Erik (good book by the way)

    It definitely was producing the error. I was very careful to test before I
    posted. But now, as you say, it doesn't do it.

    However, I wonder if I was entering ["Fred" TO "joe"] (note the capital F)
    because that IS still coming back with HTTP 500 error every time.

    http://www.lucenebook.com/search?query=%5B%22Fred%22+TO+%22joe%22%5D

    I thought I did mention colons!

    Format of a query part in Lucene is field:value

    However if the value itself contains a colon then the
    QueryParserTokenManager seems to truncate the value at that point. I would
    like to change this behaviour, in fact I think it's behaving illogically..
    the Token Manager's job is to parse into field & value, it shouldn't make
    any decisions about the value; that value should get passed intact (complete
    with colons and any other special characters) through to the Analyzer who's
    job it is to.. well.. analyse!

    Gwyn


    -----Original Message-----
    From: Erik Hatcher
    Sent: 23 January 2006 01:38
    To: java-dev@lucene.apache.org
    Subject: Re: Handling of colons in QueryParserTokenManager

    On Jan 21, 2006, at 2:16 PM, Gwyn Carwardine wrote:
    Of course I think someone needs to go into the internals anyway...
    on 1.4.3
    I get an index out of array bounds error (not a nice parse
    exception) when
    it tries to parse the following (which it should be able to do):

    ["fred" TO "joe"]

    Maybe this is fixed in 1.9 but I tried it on the www.lucenebook.com
    search
    assuming that was using a recent version and that generates a
    server error!
    It does not generate a server error on lucenebook.com:


    <http://www.lucenebook.com/search?query=%5B%22fred%22+TO+%22joe%22%5D>

    Maybe you happened to hit the server at some point when there was an
    issue with the server itself (?), but I just tried it and get plenty
    of results.
    It's a real shame that the QueryParserTokenManager had no comments
    put in to
    explain what on earth it's doing!
    Look at QueryParser.jj - that is where the rest is generated from,
    using JavaCC.

    Your subject mentions colons, but your example doesn't. Besides the
    range query example, is there an issue with colons that you want to
    ask about?

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley at Jan 23, 2006 at 3:09 pm

    On 1/23/06, Gwyn Carwardine wrote:
    the Token Manager's job is to parse into field & value, it shouldn't make
    any decisions about the value; that value should get passed intact (complete
    with colons and any other special characters)
    It's more a matter of parsing than philosophy... the parser must make
    decisions about what is part of the field value so it can know where
    it is in the grammar.

    Examples where the field value is just "bar":
    foo:bar^2
    foo:bar~2
    foo:bar baz

    Now your particular case of ':' may be solvable, but the problem in
    general is not. One must escape special characters to avoid
    ambiguity.

    -Yonik

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Yonik Seeley at Jan 23, 2006 at 3:23 pm
    I just verified the behavior of an embedded ':' and I agree it's a
    problem that needs to be fixed because it currently silently
    truncates.

    foo:bar:baz is parsed as foo:bar
    foo:bar:baz:what is parsed as foo:bar

    The parser should either
    - throw an exception
    - treat ':' (and everything after) as part of the field value

    As Erik pointed out, this would have to be fixed in the grammar: QueryParser.jj

    -Yonik
    On 1/23/06, Yonik Seeley wrote:
    On 1/23/06, Gwyn Carwardine wrote:
    the Token Manager's job is to parse into field & value, it shouldn't make
    any decisions about the value; that value should get passed intact (complete
    with colons and any other special characters)
    It's more a matter of parsing than philosophy... the parser must make
    decisions about what is part of the field value so it can know where
    it is in the grammar.

    Examples where the field value is just "bar":
    foo:bar^2
    foo:bar~2
    foo:bar baz

    Now your particular case of ':' may be solvable, but the problem in
    general is not. One must escape special characters to avoid
    ambiguity.

    -Yonik
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Erik Hatcher at Jan 24, 2006 at 10:29 am

    On Jan 23, 2006, at 6:24 AM, Gwyn Carwardine wrote:
    It definitely was producing the error. I was very careful to test
    before I
    posted. But now, as you say, it doesn't do it.

    However, I wonder if I was entering ["Fred" TO "joe"] (note the
    capital F)
    because that IS still coming back with HTTP 500 error every time.

    http://www.lucenebook.com/search?query=%5B%22Fred%22+TO+%22joe%22%5D
    Sure enough. Wow - you win the prize for finding a bug. I believe,
    but not sure yet, that this is due to a TooManyClauses error.
    Format of a query part in Lucene is field:value

    However if the value itself contains a colon then the
    QueryParserTokenManager seems to truncate the value at that point.
    I would
    like to change this behaviour, in fact I think it's behaving
    illogically..
    the Token Manager's job is to parse into field & value, it
    shouldn't make
    any decisions about the value; that value should get passed intact
    (complete
    with colons and any other special characters) through to the
    Analyzer who's
    job it is to.. well.. analyse!
    QueryParser certainly has issues when it comes to special characters,
    escaping, and analysis. It's a tricky balancing act, and it
    certainly does not apply in non-standard circumstances. Most
    commonly special characters aren't relevant to a fields value, as
    they are discarded during analysis. You've got a special case and
    need to deal with it uniquely, with QueryParser not being suitable.
    Whether it makes sense to adjust QueryParser or not, I'm not sure -
    looks like there is an improvement with colon handling needed.

    Again, changes to QueryParser occur in QueryParser.jj, not any of
    the .java files that are generated.

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Daniel Naber at Jan 21, 2006 at 8:00 pm

    On Samstag 21 Januar 2006 19:46, Chris Hostetter wrote:

    if you are flexible in the syntax you are willing to support, you can
    tell your users that they need to escape the colons that aren't ment as
    field identifiers...

    ID:CI\:123
    Or you could use a regular expression to turn ID:CI:123 into ID:CI\:123
    before the QueryParser is used. Probably simpler than messing with
    QueryParser.

    Regards
    Daniel

    --
    http://www.danielnaber.de

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJan 21, '06 at 1:15p
activeJan 24, '06 at 10:29a
posts9
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase