FAQ

substring indexing to avoid 'TooManyClauses' exception

Hardy Ferentschik
Nov 12, 2007 at 9:45 pm
Hi,

I have a question regarding the way I got around the 'TooManyClauses'
exception when using wild card queries
(http://wiki.apache.org/lucene-java/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831).


I am using Lucene in conjunction with Hibernate Search
(http://www.hibernate.org/410.html). I am indexing 'Compmany' objects
which contain multiple attibutes and the application supports different
types of searches.

One type of search is a right hand truncated (wildcard query) search of
the company name. If eg the user searches for 'M' I constructed initially
a 'M*' query. I have about 250.000 companies in the index. Without any
modifications I get the 'TooManyClauses' exception and I initially kept
increasing the 'maxClauseCount'. It works, but performace was terrible. I
haven't tried working with a filter, but instead decided to try a
different approach. I index all possible substrings of a string , eg 'Foo'
would be indexed as 'F', 'Fo' and 'Foo'.

I got rid of the 'TooManyClauses' exception and performace improved by
magnitude, but I would like to get some feedback from other users whether
this is a good approach or not.

Of course the index size increased, but that was no issue in this case.
Are there any potential problems with ranking/scoring?

Thanks for any feedback.

--Hardy


--
Hartmut Ferentschik
Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
Phone: +46 855 923 676 (h); +46 704 225 097 (m)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
reply

Search Discussions

3 responses

  • Erick Erickson at Nov 13, 2007 at 3:13 pm
    Hardy:

    I'm certainly not an expert on ranking and scoring, but I've got to assume
    that this approach influences scoring.

    Another issue is how you indexed multiple values. If you took a hint from
    the SynonymAnalyzer example in Lucene In Action, and indexed all the
    substrings with an increment of 0, you're probably OK with phrase and
    span queries. Consider indexing the following "anne gables". You'd
    index a, an, ann and anne followed by g, ga, gab, gabl, gable, gables.
    If the increments for the variants of anne were the default of 1, searching
    for the phrase "anne gables"~2 would fail because anne is really 5 or so
    terms away from gables. You can fix this by insuring that the gap
    is 0 for the variants, but you need to be aware of this.

    What about false hits? I don't know enough about your problem
    space to know whether matching on "ann AND gab" in the
    above example would be acceptable (note: no wild cards,
    so the user might be left scratching her head wondering why
    she got a match).

    There are several approaches. There is a thread titled "I just don't
    understand wildcards at all" that has a bunch of information about
    wildcards, and searching the archive for "wildcards" will turn up a
    wealth of information.

    But here are a couple of approaches:
    1> use filters. This has worked quite well for me where I construct
    the filter by WildCardTermEnum. If you have potential to re-use
    these you can always use CachingWrapperFilter (?) to keep
    them around.

    1a> you could pre-compute, say, 1 filter for each of the
    one-letter possibilities (e.g. a filter for a*, one for b*, etc)
    and use the filters for the pathological case of one leading
    character and wildcard queries for the 2 or more leading
    character wildcard queries. Assuming this performance
    was acceptable for the two-leading character case.

    2> ask your users whether there's really much value in getting
    so many hits. That is, if you can restrict the leading number
    of characters to, say, two (e.g. ab*), wildcarding might
    work acceptably out of the box. I think a legitimate question
    is "is 10,000 matching terms a useful thing to allow?" Of course
    the immediate response is "yes", but challenging the person
    to with producing an actual use case often results in the
    realization that catching the "too many clauses" error and
    responding with a message of "your search is too broad to
    be useful" is reasonable.

    Best
    Erick
    On Nov 12, 2007 4:44 PM, Hardy Ferentschik wrote:

    Hi,

    I have a question regarding the way I got around the 'TooManyClauses'
    exception when using wild card queries
    (
    http://wiki.apache.org/lucene-java/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831
    ).


    I am using Lucene in conjunction with Hibernate Search
    (http://www.hibernate.org/410.html). I am indexing 'Compmany' objects
    which contain multiple attibutes and the application supports different
    types of searches.

    One type of search is a right hand truncated (wildcard query) search of
    the company name. If eg the user searches for 'M' I constructed initially
    a 'M*' query. I have about 250.000 companies in the index. Without any
    modifications I get the 'TooManyClauses' exception and I initially kept
    increasing the 'maxClauseCount'. It works, but performace was terrible. I
    haven't tried working with a filter, but instead decided to try a
    different approach. I index all possible substrings of a string , eg 'Foo'
    would be indexed as 'F', 'Fo' and 'Foo'.

    I got rid of the 'TooManyClauses' exception and performace improved by
    magnitude, but I would like to get some feedback from other users whether
    this is a good approach or not.

    Of course the index size increased, but that was no issue in this case.
    Are there any potential problems with ranking/scoring?

    Thanks for any feedback.

    --Hardy


    --
    Hartmut Ferentschik
    Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
    Phone: +46 855 923 676 (h); +46 704 225 097 (m)

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Hardy Ferentschik at Nov 14, 2007 at 9:26 pm

    On Tue, 13 Nov 2007 16:12:26 +0100, Erick Erickson wrote:
    Thanks for your help.
    I'm certainly not an expert on ranking and scoring, but I've got to
    assume that this approach influences scoring.
    No doubt. The question is if it matters for this particular use case. For
    this particualt field I will ever only have a simple right hand truncated
    search. The user cannot use span or phrase queries against this field, not
    even explicit AND. I don't think this approach makes much sense when
    indexing a whole block of text. I only want to use it for indexing a
    simple name which at most consits of a few words. I guess what I want to
    do here is comparable to a single column SQL LIKE query, eg SELECT FROM
    COMPANY WHERE COMPANY.NAME LIKE 'M%'. Of course this is only the simple
    case. There are other queries where I combine the name search with other
    fields which are indexed using for example a SnowballAnalyzer.
    There are several approaches. There is a thread titled "I just don't
    understand wildcards at all" that has a bunch of information about
    wildcards, and searching the archive for "wildcards" will turn up a
    wealth of information.
    Great. I will look into it.

    Thanks again.

    -- Hardy
    --
    Hartmut Ferentschik
    Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
    Phone: +46 855 923 676 (h); +46 704 225 097 (m)

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Nov 14, 2007 at 9:53 pm
    Hardy:

    Since your use-case is so restricted, I'd recommend that you
    just construct a filter. I think you'll find it's much faster than
    you'd think at first glance. Of course, "Your mileage may
    vary" Is there any equivalent phrase like "Your kilometerage
    may vary" <G>?

    Most of the discussion in the archives has to do with the
    more general case, so much of it probably doesn't apply to
    your specific case.

    Best
    Erick
    On Nov 14, 2007 4:25 PM, Hardy Ferentschik wrote:

    On Tue, 13 Nov 2007 16:12:26 +0100, Erick Erickson
    wrote:

    Thanks for your help.
    I'm certainly not an expert on ranking and scoring, but I've got to
    assume that this approach influences scoring.
    No doubt. The question is if it matters for this particular use case. For
    this particualt field I will ever only have a simple right hand truncated
    search. The user cannot use span or phrase queries against this field, not
    even explicit AND. I don't think this approach makes much sense when
    indexing a whole block of text. I only want to use it for indexing a
    simple name which at most consits of a few words. I guess what I want to
    do here is comparable to a single column SQL LIKE query, eg SELECT FROM
    COMPANY WHERE COMPANY.NAME LIKE 'M%'. Of course this is only the simple
    case. There are other queries where I combine the name search with other
    fields which are indexed using for example a SnowballAnalyzer.
    There are several approaches. There is a thread titled "I just don't
    understand wildcards at all" that has a bunch of information about
    wildcards, and searching the archive for "wildcards" will turn up a
    wealth of information.
    Great. I will look into it.

    Thanks again.

    -- Hardy
    --
    Hartmut Ferentschik
    Ekholmsv.339 ,1, 127 45 Skärholmen, Sweden
    Phone: +46 855 923 676 (h); +46 704 225 097 (m)

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post