FAQ
For those with the luxury of a large store of historical queries it's interesting to note Google's approach to this.

Not some fancy spell checker - just mining searcher behaviour patterns.
Google's Bosworth describes this approach approx 13 minutes into this podcast:

http://www.itconversations.com/shows/detail571.html



----- Original Message ----
From: Van Nguyen <vnguyen@ur.com>
To: java-user@lucene.apache.org
Sent: Monday, 12 June, 2006 11:09:20 PM
Subject: RE: question with spellchecker

I'll experiment with both.

Thanks...

-----Original Message-----
From: mark harwood
Sent: Wednesday, June 07, 2006 2:16 AM
To: java-user@lucene.apache.org
Subject: Re: question with spellchecker

I think the problem in your particular example is the
suggestion software has no consideration of context.

I've been playing with context-sensitive suggestions
recently which take a bunch of validated (ie existing)
words (eg "tape") and use this to help shortlist
alternatives for an unknown or partially typed word
(eg ducted)

This has potential applications in spell checking and
as-you-type query completion.

The approach is quite simple but effective - You use
your choice of code to produce a list of candidate
terms (eg FuzzyTermEnum or some form of Soundex or
PrefixQuery) THEN take the large list of candidate
terms produced and compare their usage in relation to
the context of words you already know eg "tape".
In practice this means that TermDocs for the candidate
term are used to construct a doc bitset which is
compared with a doc bitset produced from all other
terms which make up the context. The level of
intersection between these bitsets can be used to help
sensibly rank the "duct" and "ducked" candidates in
relation to "tape". Do they co-occur often?

[psuedo code]
BitSet contextDocs= matchKnownTerms();
float numContextMatches=contextDocs.cardinality();
for all candidate terms for unknown term
{
BitSet candMatches =createBitset(candTerm)
float numCandMatches=candMatches.cardinality();
float
numSharedMatches=candMatches.and(contextDocs).cardinality()
float contextRelatedness =numSharedMatches/
(
(numCandMatches+numContextMatches)
-numSharedMatches
)
//collect candidate Terms that have high combo of
//contextRelatedness and unknown term similarity (eg
low edit distance)
}

There are quite a few optimisations I've added to this
basic pseudo code in my implementation. When I get
some time I'll package this code up and contribute it
but for now this psuedo code may give some pointers
which help to provide a solution.


Cheers,
Mark


Send instant messages to your online friends
http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 6 | next ›
Discussion Overview
groupjava-user @
categorieslucene
postedJun 7, '06 at 12:45a
activeJun 13, '06 at 7:57p
posts6
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase