Grokbase Groups Lucene dev June 2011
FAQ
[ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043248#comment-13043248 ]

Jan Høydahl commented on SOLR-1980:
-----------------------------------

Really, this is a type of feature that should be implemented on the Lucene level with proper query language support. Any suggestion on how this could be done, perhaps using the positions and #terms metadata from the index instead of inserting special tokens at begin and end?
Implement boundary match support
--------------------------------

Key: SOLR-1980
URL: https://issues.apache.org/jira/browse/SOLR-1980
Project: Solr
Issue Type: New Feature
Components: Schema and Analysis
Reporter: Jan Høydahl

Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
Example content:
1) a quick fox is brown
2) quick fox is brown
Example queries:
"^quick fox" -> should only match 2)
"brown$" -> should match 1) and 2)
"^quick fox is brown$" -> should only match 2)
Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Dawid Weiss (JIRA) at Jun 3, 2011 at 8:42 am
    [ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043252#comment-13043252 ]

    Dawid Weiss commented on SOLR-1980:
    -----------------------------------

    Isn't it what regexp query (automaton-based) currently does (and does it efficiently)?
    Implement boundary match support
    --------------------------------

    Key: SOLR-1980
    URL: https://issues.apache.org/jira/browse/SOLR-1980
    Project: Solr
    Issue Type: New Feature
    Components: Schema and Analysis
    Reporter: Jan Høydahl

    Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
    Example content:
    1) a quick fox is brown
    2) quick fox is brown
    Example queries:
    "^quick fox" -> should only match 2)
    "brown$" -> should match 1) and 2)
    "^quick fox is brown$" -> should only match 2)
    Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
    On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
    On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Jan Høydahl (JIRA) at Jun 3, 2011 at 9:10 am
    [ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043261#comment-13043261 ]

    Jan Høydahl commented on SOLR-1980:
    -----------------------------------

    Is this backed by the Lucene Query parser? How would you query q="^quick fox$" with the regex query?
    Implement boundary match support
    --------------------------------

    Key: SOLR-1980
    URL: https://issues.apache.org/jira/browse/SOLR-1980
    Project: Solr
    Issue Type: New Feature
    Components: Schema and Analysis
    Reporter: Jan Høydahl

    Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
    Example content:
    1) a quick fox is brown
    2) quick fox is brown
    Example queries:
    "^quick fox" -> should only match 2)
    "brown$" -> should match 1) and 2)
    "^quick fox is brown$" -> should only match 2)
    Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
    On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
    On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 3, 2011 at 10:38 am
    [ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043290#comment-13043290 ]

    Dawid Weiss commented on SOLR-1980:
    -----------------------------------

    Yep, it should be -- qp.parse("/^quick fox$/"). Peek at TestQueryParser#testRegexps

    Implement boundary match support
    --------------------------------

    Key: SOLR-1980
    URL: https://issues.apache.org/jira/browse/SOLR-1980
    Project: Solr
    Issue Type: New Feature
    Components: Schema and Analysis
    Reporter: Jan Høydahl

    Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
    Example content:
    1) a quick fox is brown
    2) quick fox is brown
    Example queries:
    "^quick fox" -> should only match 2)
    "brown$" -> should match 1) and 2)
    "^quick fox is brown$" -> should only match 2)
    Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
    On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
    On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jun 3, 2011 at 12:45 pm
    [ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043332#comment-13043332 ]

    Robert Muir commented on SOLR-1980:
    -----------------------------------

    you just don't need the anchors for this one (its implied).

    the syntax is here: http://www.brics.dk/automaton/doc/dk/brics/automaton/RegExp.html

    i don't know if this really solves your problems, as you are talking about multiple tokens.

    just remember, users have trouble understanding how wildcards interact with stemming and such, so I don't see regexp queries spanning across multiple tokens (analyzed) anytime soon...
    Implement boundary match support
    --------------------------------

    Key: SOLR-1980
    URL: https://issues.apache.org/jira/browse/SOLR-1980
    Project: Solr
    Issue Type: New Feature
    Components: Schema and Analysis
    Reporter: Jan Høydahl

    Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
    Example content:
    1) a quick fox is brown
    2) quick fox is brown
    Example queries:
    "^quick fox" -> should only match 2)
    "brown$" -> should match 1) and 2)
    "^quick fox is brown$" -> should only match 2)
    Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
    On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
    On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Dawid Weiss (JIRA) at Jun 3, 2011 at 12:47 pm
    [ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043334#comment-13043334 ]

    Dawid Weiss commented on SOLR-1980:
    -----------------------------------

    Right... multiple tokens will be an issue here, didn't think of that.
    Implement boundary match support
    --------------------------------

    Key: SOLR-1980
    URL: https://issues.apache.org/jira/browse/SOLR-1980
    Project: Solr
    Issue Type: New Feature
    Components: Schema and Analysis
    Reporter: Jan Høydahl

    Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
    Example content:
    1) a quick fox is brown
    2) quick fox is brown
    Example queries:
    "^quick fox" -> should only match 2)
    "brown$" -> should match 1) and 2)
    "^quick fox is brown$" -> should only match 2)
    Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
    On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
    On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Jun 3, 2011 at 12:55 pm
    [ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043338#comment-13043338 ]

    Robert Muir commented on SOLR-1980:
    -----------------------------------

    well its fine if you are doing matching on something really short, you could index with keywordtokenizer and use this for some use cases.
    Implement boundary match support
    --------------------------------

    Key: SOLR-1980
    URL: https://issues.apache.org/jira/browse/SOLR-1980
    Project: Solr
    Issue Type: New Feature
    Components: Schema and Analysis
    Reporter: Jan Høydahl

    Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
    Example content:
    1) a quick fox is brown
    2) quick fox is brown
    Example queries:
    "^quick fox" -> should only match 2)
    "brown$" -> should match 1) and 2)
    "^quick fox is brown$" -> should only match 2)
    Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
    On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
    On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Jan Høydahl (JIRA) at Jun 3, 2011 at 3:28 pm
    [ https://issues.apache.org/jira/browse/SOLR-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13043399#comment-13043399 ]

    Jan Høydahl commented on SOLR-1980:
    -----------------------------------

    I'm sure I can get it working the way I started, using CharFilter, however perhaps it's possible to implement in a more generic and Lucene-like query syntax utilizing position info from the index:

    {code}
    title:"quick fox"@N:M
    {code}
    This would mean that the phrase must be anchored between N'th and M'th token position in the field. Negative values for N/M would mean relative to the end. Thus "^quick fox$" could be written
    {code}
    title:"quick fox"@0:-0
    {code}
    Or if you require the phrase to be within first 10 words OR last 10 words:
    {code}
    title:("quick fox"@0:10 OR "quick fox"@-10:-0)
    {code}
    Requiring a term to be exactly @ position 3 would be:
    {code}
    title:fox@3:3
    {code}

    If this syntax is feasible, we could use same syntax in eDisMax's pf param in order to tell it to add a position constraint when forming the pf part of the query:
    {code}
    pf=title@0:-0
    {code}
    This would only generate a phrase match on title if the phrase is an exact match of the whole field.

    Potential issues with multi-valued fields? Is the field delimiter clearly marked or is it only an increment gap?

    Would it be easy to parse such a syntax and generate a Lucene query with the position constraints?
    Implement boundary match support
    --------------------------------

    Key: SOLR-1980
    URL: https://issues.apache.org/jira/browse/SOLR-1980
    Project: Solr
    Issue Type: New Feature
    Components: Schema and Analysis
    Reporter: Jan Høydahl

    Sometimes you need to specify that a query should match only at the start or end of a field, or be an exact match.
    Example content:
    1) a quick fox is brown
    2) quick fox is brown
    Example queries:
    "^quick fox" -> should only match 2)
    "brown$" -> should match 1) and 2)
    "^quick fox is brown$" -> should only match 2)
    Proposed way of implmementation is through a new BoundaryMatchTokenFilter which behaves like this:
    On the index side it inserts special unique tokens at beginning and end of field. These could be some weird unicode sequence.
    On the query side, it looks for the first character matching "^" or the last character matching "$" and replaces them with the special tokens.
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJun 3, '11 at 8:22a
activeJun 3, '11 at 3:28p
posts8
users1
websitelucene.apache.org

1 user in discussion

Jan Høydahl (JIRA): 8 posts

People

Translate

site design / logo © 2021 Grokbase