FAQ
I'm implementing a spellchecker in my search and have a question.

After creating the index and spellchecker index, I pass in the word
"ducted tape" to search (I am expecting "duct tape" back).

I've played around with boosting the prefixes and suffixes, setting the
accuracy, passing in an IndexReader and field to search on and setting
'morePopular' to true, but my search never returns "duct tape".

From the SpellChecker class, I see that for the word "ducted", it tries
to find the

start3:duc^2.0 end3:ted gram3:duc gram3:uct gram3:cte gram3:ted
start4:duct^2.0 end4:cted gram4:duct gram4:ucte gram4:cted

I checked to see if the word "duct" even exist in the spellchecker index
and it does. I specified a number of similar words to return that
exceeds the number of results I get from a above mentioned query to see
if I can see all the terms that it the spellchecker is suggesting; but I
do not see "duct" as a word that it is even suggested.

The list it returns is:

[dotted, coated, ductile, plated, vented, mounted, united, listed,
ductape, reduced]

Anyone have suggestions as to how to proceed from here??

Van
This communication and any documents, files, or previous electronic mail messages attached to it constitute
an electronic communication within the scope of the Electronic Communication Privacy Act, 18 USCA 2510.
This communication may contain non-public, confidential, or legally privileged information intended for the
sole use of the designated recipient(s). The unlawful interception, use or disclosure of such information is
strictly prohibited under 18 USCA 2511 and any applicable laws.

Search Discussions

  • Eks dev at Jun 7, 2006 at 6:01 am
    try your query like ((ducted^1000 duct~2) +tape)
    Or maybe (duct* +tape)
    or even better you could try to do some stemming (Porter stemmer should get rid of these ed-suffixes) and some of the above

    if this does not help, have a look at lingpipe spellChecker class as this looks like exactly what you need.

    ----- Original Message ----
    From: Van Nguyen <vnguyen@wynnesystems.com>
    To: java-user@lucene.apache.org
    Sent: Wednesday, 7 June, 2006 2:49:52 AM
    Subject: question with spellchecker

    I'm implementing a spellchecker in my search and have a question.



    After creating the index and spellchecker index, I pass in the word

    "ducted tape" to search (I am expecting "duct tape" back).



    I've played around with boosting the prefixes and suffixes, setting the

    accuracy, passing in an IndexReader and field to search on and setting

    'morePopular' to true, but my search never returns "duct tape".


    From the SpellChecker class, I see that for the word "ducted", it tries
    to find the



    start3:duc^2.0 end3:ted gram3:duc gram3:uct gram3:cte gram3:ted

    start4:duct^2.0 end4:cted gram4:duct gram4:ucte gram4:cted



    I checked to see if the word "duct" even exist in the spellchecker index

    and it does. I specified a number of similar words to return that

    exceeds the number of results I get from a above mentioned query to see

    if I can see all the terms that it the spellchecker is suggesting; but I

    do not see "duct" as a word that it is even suggested.



    The list it returns is:



    [dotted, coated, ductile, plated, vented, mounted, united, listed,

    ductape, reduced]



    Anyone have suggestions as to how to proceed from here??



    Van

    This communication and any documents, files, or previous electronic mail messages attached to it constitute

    an electronic communication within the scope of the Electronic Communication Privacy Act, 18 USCA 2510.

    This communication may contain non-public, confidential, or legally privileged information intended for the

    sole use of the designated recipient(s). The unlawful interception, use or disclosure of such information is

    strictly prohibited under 18 USCA 2511 and any applicable laws.





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark harwood at Jun 7, 2006 at 9:16 am
    I think the problem in your particular example is the
    suggestion software has no consideration of context.

    I've been playing with context-sensitive suggestions
    recently which take a bunch of validated (ie existing)
    words (eg "tape") and use this to help shortlist
    alternatives for an unknown or partially typed word
    (eg ducted)

    This has potential applications in spell checking and
    as-you-type query completion.

    The approach is quite simple but effective - You use
    your choice of code to produce a list of candidate
    terms (eg FuzzyTermEnum or some form of Soundex or
    PrefixQuery) THEN take the large list of candidate
    terms produced and compare their usage in relation to
    the context of words you already know eg "tape".
    In practice this means that TermDocs for the candidate
    term are used to construct a doc bitset which is
    compared with a doc bitset produced from all other
    terms which make up the context. The level of
    intersection between these bitsets can be used to help
    sensibly rank the "duct" and "ducked" candidates in
    relation to "tape". Do they co-occur often?

    [psuedo code]
    BitSet contextDocs= matchKnownTerms();
    float numContextMatches=contextDocs.cardinality();
    for all candidate terms for unknown term
    {
    BitSet candMatches =createBitset(candTerm)
    float numCandMatches=candMatches.cardinality();
    float
    numSharedMatches=candMatches.and(contextDocs).cardinality()
    float contextRelatedness =numSharedMatches/
    (
    (numCandMatches+numContextMatches)
    -numSharedMatches
    )
    //collect candidate Terms that have high combo of
    //contextRelatedness and unknown term similarity (eg
    low edit distance)
    }

    There are quite a few optimisations I've added to this
    basic pseudo code in my implementation. When I get
    some time I'll package this code up and contribute it
    but for now this psuedo code may give some pointers
    which help to provide a solution.


    Cheers,
    Mark


    Send instant messages to your online friends http://uk.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Van Nguyen at Jun 12, 2006 at 10:05 pm
    I'll experiment with both.

    Thanks...

    -----Original Message-----
    From: mark harwood
    Sent: Wednesday, June 07, 2006 2:16 AM
    To: java-user@lucene.apache.org
    Subject: Re: question with spellchecker

    I think the problem in your particular example is the
    suggestion software has no consideration of context.

    I've been playing with context-sensitive suggestions
    recently which take a bunch of validated (ie existing)
    words (eg "tape") and use this to help shortlist
    alternatives for an unknown or partially typed word
    (eg ducted)

    This has potential applications in spell checking and
    as-you-type query completion.

    The approach is quite simple but effective - You use
    your choice of code to produce a list of candidate
    terms (eg FuzzyTermEnum or some form of Soundex or
    PrefixQuery) THEN take the large list of candidate
    terms produced and compare their usage in relation to
    the context of words you already know eg "tape".
    In practice this means that TermDocs for the candidate
    term are used to construct a doc bitset which is
    compared with a doc bitset produced from all other
    terms which make up the context. The level of
    intersection between these bitsets can be used to help
    sensibly rank the "duct" and "ducked" candidates in
    relation to "tape". Do they co-occur often?

    [psuedo code]
    BitSet contextDocs= matchKnownTerms();
    float numContextMatches=contextDocs.cardinality();
    for all candidate terms for unknown term
    {
    BitSet candMatches =createBitset(candTerm)
    float numCandMatches=candMatches.cardinality();
    float
    numSharedMatches=candMatches.and(contextDocs).cardinality()
    float contextRelatedness =numSharedMatches/
    (
    (numCandMatches+numContextMatches)
    -numSharedMatches
    )
    //collect candidate Terms that have high combo of
    //contextRelatedness and unknown term similarity (eg
    low edit distance)
    }

    There are quite a few optimisations I've added to this
    basic pseudo code in my implementation. When I get
    some time I'll package this code up and contribute it
    but for now this psuedo code may give some pointers
    which help to provide a solution.


    Cheers,
    Mark


    Send instant messages to your online friends
    http://uk.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark harwood at Jun 13, 2006 at 11:29 am
    For those with the luxury of a large store of historical queries it's interesting to note Google's approach to this.

    Not some fancy spell checker - just mining searcher behaviour patterns.
    Google's Bosworth describes this approach approx 13 minutes into this podcast:

    http://www.itconversations.com/shows/detail571.html



    ----- Original Message ----
    From: Van Nguyen <vnguyen@ur.com>
    To: java-user@lucene.apache.org
    Sent: Monday, 12 June, 2006 11:09:20 PM
    Subject: RE: question with spellchecker

    I'll experiment with both.

    Thanks...

    -----Original Message-----
    From: mark harwood
    Sent: Wednesday, June 07, 2006 2:16 AM
    To: java-user@lucene.apache.org
    Subject: Re: question with spellchecker

    I think the problem in your particular example is the
    suggestion software has no consideration of context.

    I've been playing with context-sensitive suggestions
    recently which take a bunch of validated (ie existing)
    words (eg "tape") and use this to help shortlist
    alternatives for an unknown or partially typed word
    (eg ducted)

    This has potential applications in spell checking and
    as-you-type query completion.

    The approach is quite simple but effective - You use
    your choice of code to produce a list of candidate
    terms (eg FuzzyTermEnum or some form of Soundex or
    PrefixQuery) THEN take the large list of candidate
    terms produced and compare their usage in relation to
    the context of words you already know eg "tape".
    In practice this means that TermDocs for the candidate
    term are used to construct a doc bitset which is
    compared with a doc bitset produced from all other
    terms which make up the context. The level of
    intersection between these bitsets can be used to help
    sensibly rank the "duct" and "ducked" candidates in
    relation to "tape". Do they co-occur often?

    [psuedo code]
    BitSet contextDocs= matchKnownTerms();
    float numContextMatches=contextDocs.cardinality();
    for all candidate terms for unknown term
    {
    BitSet candMatches =createBitset(candTerm)
    float numCandMatches=candMatches.cardinality();
    float
    numSharedMatches=candMatches.and(contextDocs).cardinality()
    float contextRelatedness =numSharedMatches/
    (
    (numCandMatches+numContextMatches)
    -numSharedMatches
    )
    //collect candidate Terms that have high combo of
    //contextRelatedness and unknown term similarity (eg
    low edit distance)
    }

    There are quite a few optimisations I've added to this
    basic pseudo code in my implementation. When I get
    some time I'll package this code up and contribute it
    but for now this psuedo code may give some pointers
    which help to provide a solution.


    Cheers,
    Mark


    Send instant messages to your online friends
    http://uk.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bob Carpenter at Jun 13, 2006 at 7:57 pm
    Very nice idea. This is the basis of most of the work on
    word-sense-disambiguation (e.g. is it "run" as in baseball,
    "run" as in stock, or "run" as in stocking? or is "John Smith"
    CEO of GM or "John Smith" lover of Pocahantas?). TF/IDF's
    not a bad way to compute this, either, though there
    are definitely better classifiers for the purpose. The approach (reprinted
    below) to context winds up looking like pseudo query refinement.

    Bosworth is a bit disengenous in the podcast about just how stupid the
    techniques are that are being driven by Google's Patton-like Ph.D.s
    in tanks (his metaphor, not mine; all the Google employees I know
    are pacifists :-)). He also seems to have a bit
    of the "only Google can do this" bug, when in fact, Yahoo and Amazon
    both have very respectable web-wide spelling correction. Here's what a
    couple
    of Ph.D.s from Microsoft have to say about the problem of using
    query logs for search:

    http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Cucerzan.pdf

    Check out what Google itself says publicly about how their
    spell checker works:

    "... When we calculate a greater number of relevant search results with
    an alternative spelling,
    you'll see "Did you mean: (more common spelling)" at the top of
    your search results page."
    -
    http://www.google.com/support/bin/answer.py?answer=1723

    The key here is the "greater number of relevant search results with an
    alt spelling".
    Yahoo does the same thing, and you can even reverse-engineer the nubmers,
    as in how many times greater does the number of results with an alt spelling
    have to be (answer is about 256-2000 times more likely, depending on the
    likelihood of the substitution [more expensive early in words, vowels
    cheaper
    to substitute for each other than anything else, etc.]).
    This is essentially the basis of all the modern spell checkers, whether they
    use logs or not.

    If it was just mining searcher behavior patterns literally, you wouldn't
    expect
    them to be able to correct all the variants of Lucene, as in:

    Apache lacene -> Apache Lucen{a,b,c,...,z}

    because as popular and Google and Lucene are, there aren't
    that many typos in Google's logs. Especially from all over the
    keyboard and at every point in the word.

    So there's really a more sophisticated notion of edit distance going on
    here,
    with query logs being used as a feature (in the machine learning sense)
    to guide the process. You need a lot of logs for this, and you need to be
    able to put sessions back together from IP addresses to mine the data.
    Piece of cake, so it's a common practice.

    Now try Google search without the "Apache" in front, and you'll see they use
    context:

    Apache Lucenx -> apache lucene
    akache lucne -> apache lucene
    Lucenx -> Lucent

    Even if they did only use the logs, it still requires some pretty fancy
    algorithm work to extract matches of variable length from huge amounts
    of log material. And a lot of careful tuning that's hard to get right
    if you never leave your tank.

    Note that they'll also split and recombine, so it's not just
    token-for-token:

    apachelucene -> apache lucene
    apache luc ene -> apache lucene

    You can also see their token sensitivity in

    apache luc(ene -> apache lucene
    apache lucene( -> apache lucene(

    You can also see how they normalize inside:

    apache luc()()()()ene -> apache lucene

    This is a very hard problem to tackle multi-lingually, which is why
    Google et al. a lot of problem with false positives (correcting things
    that were right) and false negatives (missing corrections). This is
    especially obvious once you drop into a specialized domain that's
    not computer science (which is over-represented proportionally
    on the web), or a language that's much less popular on the web
    than English. For example

    s'ils vous plaid -> no correction (846 hits)
    s'ils vous plait -> no correction 1.3M hits

    - Bob Carpenter

    mark harwood wrote:
    For those with the luxury of a large store of historical queries it's interesting to note Google's approach to this.

    Not some fancy spell checker - just mining searcher behaviour patterns.
    Google's Bosworth describes this approach approx 13 minutes into this podcast:

    http://www.itconversations.com/shows/detail571.html



    ----- Original Message ----
    From: Van Nguyen <vnguyen@ur.com>
    To: java-user@lucene.apache.org
    Sent: Monday, 12 June, 2006 11:09:20 PM
    Subject: RE: question with spellchecker

    I'll experiment with both.

    Thanks...

    -----Original Message-----
    From: mark harwood
    Sent: Wednesday, June 07, 2006 2:16 AM
    To: java-user@lucene.apache.org
    Subject: Re: question with spellchecker

    I think the problem in your particular example is the
    suggestion software has no consideration of context.

    I've been playing with context-sensitive suggestions
    recently which take a bunch of validated (ie existing)
    words (eg "tape") and use this to help shortlist
    alternatives for an unknown or partially typed word
    (eg ducted)

    This has potential applications in spell checking and
    as-you-type query completion.

    The approach is quite simple but effective - You use
    your choice of code to produce a list of candidate
    terms (eg FuzzyTermEnum or some form of Soundex or
    PrefixQuery) THEN take the large list of candidate
    terms produced and compare their usage in relation to
    the context of words you already know eg "tape".
    In practice this means that TermDocs for the candidate
    term are used to construct a doc bitset which is
    compared with a doc bitset produced from all other
    terms which make up the context. The level of
    intersection between these bitsets can be used to help
    sensibly rank the "duct" and "ducked" candidates in
    relation to "tape". Do they co-occur often?

    [psuedo code]
    BitSet contextDocs= matchKnownTerms();
    float numContextMatches=contextDocs.cardinality();
    for all candidate terms for unknown term
    {
    BitSet candMatches =createBitset(candTerm)
    float numCandMatches=candMatches.cardinality();
    float
    numSharedMatches=candMatches.and(contextDocs).cardinality()
    float contextRelatedness =numSharedMatches/
    (
    (numCandMatches+numContextMatches)
    -numSharedMatches
    )
    //collect candidate Terms that have high combo of
    //contextRelatedness and unknown term similarity (eg
    low edit distance)
    }

    There are quite a few optimisations I've added to this
    basic pseudo code in my implementation. When I get
    some time I'll package this code up and contribute it
    but for now this psuedo code may give some pointers
    which help to provide a solution.


    Cheers,
    Mark


    Send instant messages to your online friends
    http://uk.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 7, '06 at 12:45a
activeJun 13, '06 at 7:57p
posts6
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase