FAQ
Hi,

I am trying to implement a "progressive search" with Lucene. What I mean is that
something like what Google does: you type a few letters and google searches for
matches as you type. The more letters you enter, the more precise your search
becomes.

I decided to use a prefix query because otherwise, I need to have complete words
in order for multiword queries to work.

I am using Compass as my Lucene frontend.

My query looks like this:

BooleanQuery bq = new BooleanQuery();
// The last word is a prefix
PrefixQuery pq =
new PrefixQuery(new Term("searchField", words[words.length - 1]));
bq.add(pq, BooleanClause.Occur.MUST);

// All others are normal terms
for (int i = 0; i <= (words.length - 2); i++) {
TermQuery tq = new TermQuery(new Term("searchField", words[i]));
bq.add(tq, BooleanClause.Occur.MUST);
}

The problem I have is that if I specify "little fa" as search terms, Lucene will
match

The little fairy

but also

Farris little
The little pig farmer
Chicken Little: looking far ahead

(each line is the content of a separate document)

I only want the first type of matches.

What should I be doing instead?

Thanks,

L


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Ahmet Arslan at Jan 5, 2011 at 4:59 pm

    I am trying to implement a "progressive search" with
    Lucene. What I mean is that
    something like what Google does: you type a few letters and
    google searches for
    matches as you type. The more letters you enter, the more
    precise your search
    becomes.

    I decided to use a prefix query because otherwise, I need
    to have complete words
    in order for multiword queries to work.

    I am using Compass as my Lucene frontend.

    My query looks like this:

    BooleanQuery bq = new BooleanQuery();
    // The last word is a prefix
    PrefixQuery pq =
    new PrefixQuery(new Term("searchField",
    words[words.length - 1]));
    bq.add(pq, BooleanClause.Occur.MUST);

    // All others are normal terms
    for (int i = 0; i <= (words.length -
    2); i++) {
    TermQuery tq = new TermQuery(new
    Term("searchField", words[i]));
    bq.add(tq,
    BooleanClause.Occur.MUST);
    }

    The problem I have is that if I specify "little fa" as
    search terms, Lucene will
    match

    The little fairy

    but also

    Farris little
    The little pig farmer
    Chicken Little: looking far ahead

    (each line is the content of a separate document)

    I only want the first type of matches.
    So order of search terms are important to you. Since you are constructing your queries programmatically, you can use SpanQuery family.

    If you substitute PrefixQuery with SpanRegexQuery, and TermQuery with SpanTermQuery. And combine them in an ordered SpanNearQuery, I think you can achieve what you want.




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • L Duperval at Jan 5, 2011 at 6:05 pm
    Ahmet,

    Ahmet Arslan <iorixxx <at> yahoo.com> writes:
    So order of search terms are important to you. Since you are constructing your
    queries programmatically,
    you can use SpanQuery family.
    Yes, order is important.
    If you substitute PrefixQuery with SpanRegexQuery, and TermQuery with
    SpanTermQuery. And combine them
    in an ordered SpanNearQuery, I think you can achieve what you want.
    I'll look at the documentation to see how to implement it effectively.

    Thanks,

    L


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Philip Puffinburger at Jan 5, 2011 at 5:05 pm
    We do something similar with a PrefixQuery. But the way we do it is to use a Keyword field to use the PrefixQuery against.

    So if we had a Book like with a title like 'The Brown Dog', we would end up with fields in the document like:

    Used for the normal full text searching

    title : the brown dog

    Used for the prefix searching

    titlekeyword : the brown dog
    titlekeyword : brown dog

    So as the user is typing we are looking up using a PrefixQuery against the titlekeyword field. We had tried things like span queries against the title field before settling on this approach (we also use this field for other things, not just for the PrefixQuery).

    On Jan 5, 2011, at 11:38 AM, L Duperval wrote:

    Hi,

    I am trying to implement a "progressive search" with Lucene. What I mean is that
    something like what Google does: you type a few letters and google searches for
    matches as you type. The more letters you enter, the more precise your search
    becomes.

    I decided to use a prefix query because otherwise, I need to have complete words
    in order for multiword queries to work.

    I am using Compass as my Lucene frontend.

    My query looks like this:

    BooleanQuery bq = new BooleanQuery();
    // The last word is a prefix
    PrefixQuery pq =
    new PrefixQuery(new Term("searchField", words[words.length - 1]));
    bq.add(pq, BooleanClause.Occur.MUST);

    // All others are normal terms
    for (int i = 0; i <= (words.length - 2); i++) {
    TermQuery tq = new TermQuery(new Term("searchField", words[i]));
    bq.add(tq, BooleanClause.Occur.MUST);
    }

    The problem I have is that if I specify "little fa" as search terms, Lucene will
    match

    The little fairy

    but also

    Farris little
    The little pig farmer
    Chicken Little: looking far ahead

    (each line is the content of a separate document)

    I only want the first type of matches.

    What should I be doing instead?

    Thanks,

    L


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • L Duperval at Jan 5, 2011 at 6:01 pm
    Philip,

    Philip Puffinburger <ppuffinburger <at> tlcdelivers.com> writes:
    So if we had a Book like with a title like 'The Brown Dog', we would end up
    with fields in the document like:
    Used for the normal full text searching

    title : the brown dog

    Used for the prefix searching

    titlekeyword : the brown dog
    titlekeyword : brown dog
    I also have two fields, one for indexing and another for display. How does the
    above affect searching? If you type "brown do" will it find the title correctly
    or do you have to type "brown dog" in order to get a match? Would "brown do"
    match "The brown horse has a dog" or not? My understanding is that that Lucene
    (BTW, I'm using 2.4.1 because it's the latest version to work with Compass)
    matches the prefix first, and then combines the matching results with other
    clauses as specified.
    So as the user is typing we are looking up using a PrefixQuery against the
    titlekeyword field. We had tried
    things like span queries against the title field before settling on this
    approach (we also use this field
    for other things, not just for the PrefixQuery).
    That's what I was planning to look at next. Why did you choose not to use this
    approach? Is it because of the other things you want to do with those fields or
    something about the way the SpanQuery classes work?

    If you are at liberty to share part of your code I'd appreciated it.

    Thanks,

    L






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Philip Puffinburger at Jan 5, 2011 at 7:02 pm

    On Jan 5, 2011, at 1:00 PM, L Duperval wrote:

    Philip,

    I also have two fields, one for indexing and another for display. How does the
    above affect searching? If you type "brown do" will it find the title correctly
    or do you have to type "brown dog" in order to get a match? Would "brown do"
    match "The brown horse has a dog" or not? My understanding is that that Lucene
    (BTW, I'm using 2.4.1 because it's the latest version to work with Compass)
    matches the prefix first, and then combines the matching results with other
    clauses as specified.
    No. Typing "brown do" will match on "brown dog" but not match on "the brown dog" that way we don't care which way the user types it. In our system "brown do" will not match on "the brown horse has a dog". We only do the PrefixQuery which is against the keyword field ("brown dog" is a single term as is "the brown dog"). We don't have a BooleanQuery like you do, but I don't see why it wouldn't work.

    We basically have a method that looks something like

    List<Book> getBooksBeginningWithTitle(String prefix);

    and that code looks something like (we use Hibernate Search and not Compass, but they are pretty similar) :

    FullTextSession fullTextSession = Search.getFullTextSession(getSession());
    PrefixQuery prefixQuery = new PrefixQuery(new Term("titlekeyword", TextNormailzationUtil.transformKeyword(prefix, LetterCaseTransform.Lower)));
    FullTextQuery ftQuery = fullTextSession.createFullTextQuery(prefixQuery, Book.class);
    return ftQuery.list();



    The field creation for the keyword fields looks like (done in a Hibernate Search construct called a FieldBridge - can't remember if Compass has something similar)

    document.add(new Field("titlekeyword", TextNormailzationUtil.transformKeyword(fullTitle, LetterCaseTransform.Lower), Store.NO, Index.NOT_ANALYZED_NO_NORMS));
    document.add(new Field("titlekeyword", TextNormailzationUtil.transformKeyword(partialTitle, LetterCaseTransform.Lower), Store.NO, Index.NOT_ANALYZED_NO_NORMS));

    The partialTitle is just the full title with leading articles removed ('A', 'An', 'The', 'L'', etc).

    The TextNormalizationUtil.transformKeyword in this case removes punctuation and non-spacing marks from the text and then lowercases. This is a business decision because in a keyword the case matters and users might not type in the punctuation or have Caps Lock key on so we normalize things down. You have to be sure that the same normalization happens at index and at search time.


    That's what I was planning to look at next. Why did you choose not to use this
    approach? Is it because of the other things you want to do with those fields or
    something about the way the SpanQuery classes work?
    I needed the field for other things and the code to do the PrefixQuery against this field was pretty simple.

    We use SpanQuery's (well, list of SpanRegexQuery clauses fed into a SpanNearQuery) when we do something similar with authors (user can type an author name in first/last or last/first order and then what about any additional parts of their name - which means we would have had to create a lot of keyword fields to handle all the combinations and would still have missed some).



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • L Duperval at Jan 5, 2011 at 8:07 pm

    Philip Puffinburger <ppuffinburger <at> tlcdelivers.com> writes:
    We only do the PrefixQuery which is against the keyword field ("brown dog"
    is a single term as is "the brown dog"). We don't have a BooleanQuery
    like you do, but I don't see why it wouldn't work.
    Ahh. OK, so you probably aren't using a whitespace analyzer like we are. We
    chose whitespace because we wanted to be able to search for multiple words, no
    matter where they occurred in the text. That way, we could (wanted to?) match
    "brown dog" with "the brown dog" or "the horse has a brown dog". We had thought
    of breaking up our date in multiple pieces like you are doing but were worried
    about memory and performance (we're storing the index in RAM). I think about
    this.

    Thanks for all the information. I'll do some testing on my end to see if I can
    do better than what I've got. I'll also have to possibly rethink some of our
    features (i.e. matching from the start of the title instead of the matching
    anywhere as we are currently doing).

    Thanks for your generosity,

    L




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Cameron Leach at Jan 5, 2011 at 9:49 pm
    L -

    I faced the exact same problem you're having. I ended up writing a custom
    Analyzer to tokenize the terms the way I wanted. If memory serves me
    correctly, the WhitespaceAnalyzer will do this:

    "the brown dog" ->
    the
    brown
    dog

    I think what you want is for something like this:

    "the brown dog" ->
    the brown dog
    brown dog
    dog

    If you write your custom analyzer accordingly, to trim terms from the
    beginning and then use the NGramTokenFilter, you should get your real-time
    search results back the way you expect. A small caveat is that spans won't
    work here (e.g. 'the do' won't match 'the brown dog'), which might be what
    you want. I wasn't ever able to figure out a way to do this with
    WhitespaceAnalyzer and a tricky query.

    Hope that helps a little.

    On Wed, Jan 5, 2011 at 12:07 PM, L Duperval wrote:

    Philip Puffinburger <ppuffinburger <at> tlcdelivers.com> writes:
    We only do the PrefixQuery which is against the keyword field ("brown dog"
    is a single term as is "the brown dog"). We don't have a BooleanQuery
    like you do, but I don't see why it wouldn't work.
    Ahh. OK, so you probably aren't using a whitespace analyzer like we are. We
    chose whitespace because we wanted to be able to search for multiple words,
    no
    matter where they occurred in the text. That way, we could (wanted to?)
    match
    "brown dog" with "the brown dog" or "the horse has a brown dog". We had
    thought
    of breaking up our date in multiple pieces like you are doing but were
    worried
    about memory and performance (we're storing the index in RAM). I think
    about
    this.

    Thanks for all the information. I'll do some testing on my end to see if I
    can
    do better than what I've got. I'll also have to possibly rethink some of
    our
    features (i.e. matching from the start of the title instead of the matching
    anywhere as we are currently doing).

    Thanks for your generosity,

    L




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • L Duperval at Jan 6, 2011 at 1:45 pm
    Cameron,

    Cameron Leach <cameron.developer <at> gmail.com> writes:
    I think what you want is for something like this:

    "the brown dog" ->
    the brown dog
    brown dog
    dog

    If you write your custom analyzer accordingly, to trim terms from the
    beginning and then use the NGramTokenFilter, you should get your real-time
    search results back the way you expect. A small caveat is that spans won't
    work here (e.g. 'the do' won't match 'the brown dog'), which might be what
    you want.
    Thanks, that's another possible approach. I have a few that I need to sort
    through and test out. I also need to take into account performance and memory
    usage. I have to index about 1M small documents in RAM so if additional
    tokenizing is anything more than linear, I may have to rethink this.

    Thanks,

    L



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 5, '11 at 4:39p
activeJan 6, '11 at 1:45p
posts9
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase