FAQ
Hello,

I am currently keeping an index of all our client's usernames. The search
functionality is implemented using a PrefixFilter. However, we would like to
expand the functionality to be able to search any part of a user's name,
rather than requiring that it begin with the query string. So for example,
the search term 'mit' would return the username 'smith'.

I am hesitant to use a WildcardQuery starting with an asterisk because I've
read about why this is a bad idea. I am looking for suggestions on the best
way to implement this.

The idea I've come up with is to index each part of the username; so for
example, if the username is 'mark', you would index mark, ark, rk, and k.
Then you could still use the PrefixFilter. I'm not overly concerned about
how this would enlarge the index because usernames tend to be fairly short.

I am very much open to other suggestions however. Does anyone have any
opinions or ideas that they can share?

Thanks very much.

Mark

Search Discussions

  • Mark Ferguson at Jun 25, 2008 at 5:48 pm
    Hello,

    I am currently keeping an index of all our client's usernames. The search
    functionality is implemented using a PrefixFilter. However, we would like to
    expand the functionality to be able to search any part of a user's name,
    rather than requiring that it begin with the query string. So for example,
    the search term 'mit' would return the username 'smith'.

    I am hesitant to use a WildcardQuery starting with an asterisk because I've
    read about why this is a bad idea. I am looking for suggestions on the best
    way to implement this.

    The idea I've come up with is to index each part of the username; so for
    example, if the username is 'mark', you would index mark, ark, rk, and k.
    Then you could still use the PrefixFilter. I'm not overly concerned about
    how this would enlarge the index because usernames tend to be fairly short.

    I am very much open to other suggestions however. Does anyone have any
    opinions or ideas that they can share?

    Thanks very much.

    Mark
  • Erick Erickson at Jun 25, 2008 at 6:26 pm
    Warning: I don't understand ngrams at all, so you should
    read this as a plea for those who do to tell me I'm off base <G>.


    But I wonder if indexing as n-grams would be a way to
    cope with this issue that lots of people have. *assuming*
    you are thinking about single terms, then it seems that
    "smith" would be tokenized as sm, mi, it, th. Then
    a wildcard search for "mi it" would hit (as a phrase
    query or a SpanQuery with slop of 0). It seems like there
    are several issues to work out here, especially including
    multiple terns, matching mixtures of wildcards and
    non-wildcards, etc.

    But it seems do-able....


    Another approach is to use WildcardTernEnum and/or
    RegexTermEnum to build up a filter and use the filter as
    part of the query. What you loose with this approach is
    that the filter (and wildcards) then don't contribute to
    scoring. But this isn't a huge price to pay...

    Best
    Erick
    On Wed, Jun 25, 2008 at 1:47 PM, Mark Ferguson wrote:

    Hello,

    I am currently keeping an index of all our client's usernames. The search
    functionality is implemented using a PrefixFilter. However, we would like
    to
    expand the functionality to be able to search any part of a user's name,
    rather than requiring that it begin with the query string. So for example,
    the search term 'mit' would return the username 'smith'.

    I am hesitant to use a WildcardQuery starting with an asterisk because I've
    read about why this is a bad idea. I am looking for suggestions on the best
    way to implement this.

    The idea I've come up with is to index each part of the username; so for
    example, if the username is 'mark', you would index mark, ark, rk, and k.
    Then you could still use the PrefixFilter. I'm not overly concerned about
    how this would enlarge the index because usernames tend to be fairly short.

    I am very much open to other suggestions however. Does anyone have any
    opinions or ideas that they can share?

    Thanks very much.

    Mark
  • Mark Ferguson at Jun 27, 2008 at 3:58 pm
    Hi Erick,

    Thanks for the suggestions. I've used indexed n-grams before to implement
    spell-checking; I think in this case I may take a look at WildcardTermEnum
    and RegexTermEnum. It seems like a good solution because I am doing my own
    results ordering so Lucene's scoring is irrelevant in this case. I wasn't
    aware of these classes so thanks for mentioning them!

    Best,

    Mark

    On Wed, Jun 25, 2008 at 12:25 PM, Erick Erickson wrote:

    Warning: I don't understand ngrams at all, so you should
    read this as a plea for those who do to tell me I'm off base <G>.


    But I wonder if indexing as n-grams would be a way to
    cope with this issue that lots of people have. *assuming*
    you are thinking about single terms, then it seems that
    "smith" would be tokenized as sm, mi, it, th. Then
    a wildcard search for "mi it" would hit (as a phrase
    query or a SpanQuery with slop of 0). It seems like there
    are several issues to work out here, especially including
    multiple terns, matching mixtures of wildcards and
    non-wildcards, etc.

    But it seems do-able....


    Another approach is to use WildcardTernEnum and/or
    RegexTermEnum to build up a filter and use the filter as
    part of the query. What you loose with this approach is
    that the filter (and wildcards) then don't contribute to
    scoring. But this isn't a huge price to pay...

    Best
    Erick
    On Wed, Jun 25, 2008 at 1:47 PM, Mark Ferguson wrote:

    Hello,

    I am currently keeping an index of all our client's usernames. The search
    functionality is implemented using a PrefixFilter. However, we would like
    to
    expand the functionality to be able to search any part of a user's name,
    rather than requiring that it begin with the query string. So for example,
    the search term 'mit' would return the username 'smith'.

    I am hesitant to use a WildcardQuery starting with an asterisk because I've
    read about why this is a bad idea. I am looking for suggestions on the best
    way to implement this.

    The idea I've come up with is to index each part of the username; so for
    example, if the username is 'mark', you would index mark, ark, rk, and k.
    Then you could still use the PrefixFilter. I'm not overly concerned about
    how this would enlarge the index because usernames tend to be fairly short.
    I am very much open to other suggestions however. Does anyone have any
    opinions or ideas that they can share?

    Thanks very much.

    Mark
  • Chris Hostetter at Jun 28, 2008 at 12:33 am
    : Thanks for the suggestions. I've used indexed n-grams before to implement
    : spell-checking; I think in this case I may take a look at WildcardTermEnum
    : and RegexTermEnum. It seems like a good solution because I am doing my own
    : results ordering so Lucene's scoring is irrelevant in this case. I wasn't
    : aware of these classes so thanks for mentioning them!

    using the Enum's directly will help you avoid potential "TooManyClauses"
    exceptions that you would get with a straight WildcardQuery, but it should
    be more efficient to index ngrams and then do a Prefix style search
    because then you can skipTo(yourTerm) and iterate from there.

    with WildcardTermEnum if you have a leading wildcard the TermEnum has to
    "next()" over every term inthe field.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 25, '08 at 5:43p
activeJun 28, '08 at 12:33a
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase