Grokbase Groups Lucene dev July 2006
FAQ

[jira] Created: (LUCENE-626) Adaptive, user query session analyzing spell checker.

Karl Wettin (JIRA)
Jul 13, 2006 at 9:22 am
Adaptive, user query session analyzing spell checker.
-----------------------------------------------------

Key: LUCENE-626
URL: http://issues.apache.org/jira/browse/LUCENE-626
Project: Lucene - Java
Type: New Feature

Components: Search
Reporter: Karl Wettin
Priority: Minor
Attachments: spellcheck_0.0.1.tar.gz
From javadocs:
This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.

Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.

TODO:

1. Gram up results to detect compositewords that should not be composite words, and vice verse.

2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.

Usage:

Training

At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.

You also want to call the bootstrap() method every 100000 queries or so.

Spell checking

Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.

Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.

I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
reply

Search Discussions

44 responses

  • Karl Wettin (JIRA) at Jul 13, 2006 at 9:22 am
    [ http://issues.apache.org/jira/browse/LUCENE-626?page=all ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: spellcheck_0.0.1.tar.gz

    Beta 1.

    Will add prevayler transactions in future.
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: http://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellcheck_0.0.1.tar.gz

    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators:
    http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see:
    http://www.atlassian.com/software/jira


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl wettin at Jul 14, 2006 at 9:56 am

    On Thu, 2006-07-13 at 09:20 +0000, Karl Wettin (JIRA) wrote:
    Adaptive, user query session analyzing spell checker.
    I have a database with 3 million+ real user queries (session id,
    timestamp, query and hits) if anyone is interested in fooling around
    with the code. And if there is an interest, I might just manage to
    convince the owners to contribute the data to Apache.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jul 26, 2006 at 1:04 am
    [ http://issues.apache.org/jira/browse/LUCENE-626?page=all ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: spellcheck_20060725.tar.gz

    Bugfixes in bootstrap() and correction sequence extraction.

    A couple of optimizations.

    Negative training (didNotMean), but no automatic detection yet. I'm evaluation a couple of solutions. So perhaps next time(tm).
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: http://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellcheck_0.0.1.tar.gz, spellcheck_20060725.tar.gz


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Aug 4, 2006 at 6:03 am
    [ http://issues.apache.org/jira/browse/LUCENE-626?page=all ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: spellcheck_20060804.tar.gz

    beta 3

    total rewrite with focus on adaptation.

    session search sequence extraction, training and suggesting are now seperate classes passed to the spell checker.

    still require lots of user interaction to build a sufficient dictionary.

    has no optimization. bootstrap has been removed and will probably re-appear in future default suggestion scheme instead. should be fast enough.

    now also comes with some junit test cases.

    default implementations are quite simple, but effective: strips suggestive data (trained suggestive- and test phrases) from punctuation and whitespace in order to find incorrect composite and decomposed words. e.g. "the davinci code" --> "the da vinci code", "a clock work orange" --> "a clockwork orage".


    beta 4 will focus on training- and suggestion classes that works on secondary trie populated with known good data extracted from corpus, navigated with edit distance. perhaps a forest-type trie to allow any starting point in a phrase.

    OR

    beta 4 will focus on discrimiating trained queries to build clusters and suggest (facet) classes parallell to a plain text suggestion. that would be a major ram-consumer and require lots of manual tweaking per implemenation, but a cool enough feature.

    time will tell.
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: http://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellcheck_0.0.1.tar.gz, spellcheck_20060725.tar.gz, spellcheck_20060804.tar.gz


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Aug 4, 2006 at 9:18 pm
    [ http://issues.apache.org/jira/browse/LUCENE-626?page=all ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: spellcheck_20060804_2.tar.gz

    oups, i attached the old code last time.
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: http://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellcheck_0.0.1.tar.gz, spellcheck_20060725.tar.gz, spellcheck_20060804.tar.gz, spellcheck_20060804_2.tar.gz


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jan 30, 2007 at 1:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: spellcheck_0.0.1.tar.gz)
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellcheck_20060804.tar.gz, spellcheck_20060804_2.tar.gz


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jan 30, 2007 at 1:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: spellcheck_20060725.tar.gz)
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellcheck_20060804.tar.gz, spellcheck_20060804_2.tar.gz


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jan 30, 2007 at 1:01 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: spellcheck_20060804.tar.gz)
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellcheck_20060804_2.tar.gz


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jan 30, 2007 at 12:26 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: spellcheck_20060804_2.tar.gz)
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jan 30, 2007 at 12:26 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: spellchecker.diff

    It uses the ngram spell checker for queries yet not corrected by users, but it handles more than one word at the time, and it inspects the term position vector if available. This way it can also rearange input to the most probable order.

    addDocument(indexWriter, field, "heroes of might and magic III complete");
    addDocument(indexWriter, field, "it might be the best game ever made");

    assertEquals("heroes of might and magic", suggester.didYouMean("hereos of magic and might"));
    assertEquals("heroes of might and magic", suggester.didYouMean("hereos of light and magic"));
    assertEquals("heroes might magic", suggester.didYouMean("magic light heros"));
    assertEquals("best game made", suggester.didYouMean("game best made"));
    assertEquals("game made", suggester.didYouMean("made game"));
    assertEquals("game made", suggester.didYouMean("made lame"));
    assertEquals(null, suggester.didYouMean("may game"));

    Once someone clicks on a suggestion (you have to report this back to the suggester) it will get a higher priority. If the person reports interest in one or many of the results in the followed suggested query, it will get an even higher priority. If something is suggested but not clicked on, then the priority will go down. When the priority reaches a lower threadshold, it will no loger be suggested, and the next best suggestion will appear. And so on.

    To change the query manually is the same thing as clicking on a suggestions, given it is similar enough and withing a certain timeframe.

    assertEquals("homm", suggester.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", suggester.didYouMean("heroes of night and magic"));

    assertEquals("homm", suggester.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", suggester.didYouMean("homm"));

    The data is stored in a Map<String /*query*/, List<Suggestion>>, and the default implementation strips the query from p{Punct}. That should help with composite and decomposite, amongst much.

    assertEquals("the da vinci code", suggester.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", suggester.didYouMean("the dav-inci code"));
    assertEquals("heroes of might and magic", suggester.didYouMean("heroes ofnight andmagic"));

    It seems as the ngram spell check tests is broken - requires the removed class English. I've re-introduced it in Lucene-550.

    I will not work further on this patch and issue. It will be added to Lucene-550 for chaching and such.
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl wettin at Jan 30, 2007 at 11:38 pm
    30 jan 2007 kl. 13.25 skrev Karl Wettin (JIRA):
    [ https://issues.apache.org/jira/browse/LUCENE-626

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: spellchecker.diff
    I've been running this version live today. Suggests great stuff all
    the time. It is however a bit RAM hogging just as everything else I
    do. Think I'll add some sort of external persistency to handle that
    (probably BDB), backed by a soft referenced cache.

    There is a problem with the adaptive layer not adapting to (correct)
    suggestions with large edit distance supplied by the multi word/term
    position vector layer on top of the ngram spell checker. E.g. "magic
    might heros" -> "heroes might magic".

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jan 30, 2007 at 11:40 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468825 ]

    Karl Wettin commented on LUCENE-626:
    ------------------------------------




    I've been running this version live today. Suggests great stuff all
    the time. It is however a bit RAM hogging just as everything else I
    do. Think I'll add some sort of external persistency to handle that
    (probably BDB), backed by a soft referenced cache.

    There is a problem with the adaptive layer not adapting to (correct)
    suggestions with large edit distance supplied by the multi word/term
    position vector layer on top of the ngram spell checker. E.g. "magic
    might heros" -> "heroes might magic".

    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Comment: was deleted
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Comment: was deleted
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Comment: was deleted
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Comment: was deleted
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Comment: was deleted
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 1:47 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Comment: was deleted
    Adaptive, user query session analyzing spell checker.
    -----------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: New Feature
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.
    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.
    TODO:
    1. Gram up results to detect compositewords that should not be composite words, and vice verse.
    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.
    Usage:
    Training
    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.
    You also want to call the bootstrap() method every 100000 queries or so.
    Spell checking
    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.
    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.
    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 2:19 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Description:
    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.

    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.

    They must match as a query against an apriori index. By using a span near query (default) you get features like this:

    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));

    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.

    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());

    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.

    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.

    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));

    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));

    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));


    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.


    was:
    From javadocs:
    This is an adaptive, user query session analyzing spell checker. In plain words, a word and phrase dictionary that will learn from how users act while searching.

    Be aware, this is a beta version. It is not finished, but yeilds great results if you have enough user activity, RAM and a faily narrow document corpus. The RAM problem can be fixed if you implement your own subclass of SpellChecker as the abstract methods of this class are the CRUD methods. This will most probably change to a strategy class in future version.

    TODO:

    1. Gram up results to detect compositewords that should not be composite words, and vice verse.

    2. Train a gramed token (markov) chain with output from an expectation maximization algorithm (weka clusters?) parallel to a closest path (A* or bredth first?) to allow contextual suggestions on queries that never was placed.

    Usage:

    Training

    At user query time, create an instance of QueryResults containg the query string, number of hits and a time stamp. Add it to a chronologically ordered list in the user session (LinkedList makes sense) that you pass on to train(sessionQueries) as the session times out.

    You also want to call the bootstrap() method every 100000 queries or so.

    Spell checking

    Call getSuggestions(query) and look at the results. Don't modify it! This method call will be hidden in a facade in future version.

    Note that the spell checker is case sensitive, so you want to clean up query the same way when you train as when you request the suggestions.

    I recommend something like query = query.toLowerCase().replaceAll(" ", " ").trim()

    Lucene Fields: [Patch Available]
    Assignee: Karl Wettin
    Issue Type: Improvement (was: New Feature)
    Summary: Extended spell checker with phrase support and adaptive user session analysis. (was: Adaptive, user query session analyzing spell checker.)

    All of the old comments was obsolete, so I re-initialized the whole issue.
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.
    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.
    They must match as a query against an apriori index. By using a span near query (default) you get features like this:
    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));
    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.
    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());
    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.
    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.
    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));
    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));
    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 2:23 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: spellchecker.diff)
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.
    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.
    They must match as a query against an apriori index. By using a span near query (default) you get features like this:
    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));
    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.
    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());
    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.
    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.
    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));
    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));
    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 3, 2007 at 2:23 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: spellchecker.diff

    NgramPhraseSuggester is now decoupled from the adaptive layer, but I would like to refactor it even more so it is easy to replace the SpellChecker with any other single token suggester.
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.
    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.
    They must match as a query against an apriori index. By using a span near query (default) you get features like this:
    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));
    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.
    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());
    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.
    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.
    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));
    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));
    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 9, 2007 at 12:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471679 ]

    Karl Wettin commented on LUCENE-626:
    ------------------------------------

    Almost all TODO:s in the code (mostly refactoring to support alternative single token suggesters [as in the old spell checker]) have been fixed, but as it depends on LUCENE-550 is is available in that issue.
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.
    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.
    They must match as a query against an apriori index. By using a span near query (default) you get features like this:
    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));
    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.
    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());
    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.
    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.
    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));
    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));
    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Feb 17, 2007 at 7:56 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Comment: was deleted
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: spellchecker.diff


    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.
    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.
    They must match as a query against an apriori index. By using a span near query (default) you get features like this:
    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));
    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.
    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());
    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.
    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.
    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));
    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));
    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Mar 3, 2007 at 8:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: didyoumean.patch.bz2

    Patch depends on LUCENE-550.
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: didyoumean.patch.bz2, spellchecker.diff


    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.
    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.
    They must match as a query against an apriori index. By using a span near query (default) you get features like this:
    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));
    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.
    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());
    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.
    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.
    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));
    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));
    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));
    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Mar 3, 2007 at 8:08 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Description:
    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description

    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.

    Example:
    {code:java}
    public void testImportData() throws Exception {

    // load 200 000 user queries with session data and time stamp. no goals specified.

    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");

    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));

    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));

    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));

    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));

    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));

    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));

    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);

    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));

    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));

    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));


    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));

    }
    {code}


    was:
    Some minor changes to how the single token ngram spell checker in contrib/spellcheck, but nothing that breaks any old implementation I think. Also fixed the broken test.

    NgramPhraseSuggestier tokenizes a query and suggests combinations of the single token suggestions matrix.

    They must match as a query against an apriori index. By using a span near query (default) you get features like this:

    assertEquals("lost in translation", ngramSuggester.didYouMean("lost on translation"));

    If term position vectors are available it is possible to make it context sensitive (or what one may call it) to suggest a new term order.

    assertEquals("heroes might magic", ngramSuggester.didYouMean("magic light heros"));
    assertEquals("heroes of might and magic", ngramSuggester.didYouMean("heros on light and magik"));
    assertEquals("best game made", ngramSuggester.didYouMean("game best made"));
    assertEquals("game made", ngramSuggester.didYouMean("made game"));
    assertEquals("game made", ngramSuggester.didYouMean("made lame"));
    assertEquals("the game", ngramSuggester.didYouMean("the game"));
    assertEquals("in the fame", ngramSuggester.didYouMean("in the game"));
    assertEquals("game", ngramSuggester.didYouMean("same"));
    assertEquals(0, ngramSuggester.suggest("may game").size());

    SessionAnalyzedDictionary is the adaptive layer, that learns from how users changed their queries, what data they inspected, et c. It will automagically find and suggest synonyms, decomposed words, and probably a lot of other neat features I still have not detected.

    A bit depending on the situation, ignored suggestions get suppresed and followed suggestions get suggeted even more.

    assertEquals("the da vinci code", dictionary.didYouMean("thedavincicode"));
    assertEquals("the da vinci code", dictionary.didYouMean("the davinci code"));

    assertEquals("homm", dictionary.didYouMean("heroes of might and magic"));
    assertEquals("heroes of might and magic", dictionary.didYouMean("homm"));

    assertEquals("heroes of might and magic 2", dictionary.didYouMean("heroes of might and magic ii"));
    assertEquals("heroes of might and magic ii", dictionary.didYouMean("heroes of might and magic 2"));


    The adaptive layer is not yet(tm) persistent, but soft referenced so that the dictionary don't go eat up all your RAM.


    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: didyoumean.patch.bz2, spellchecker.diff


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Nicolas Lalevée (JIRA) at Mar 3, 2007 at 9:05 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477685 ]

    Nicolas Lalevée commented on LUCENE-626:
    ----------------------------------------

    This feature looks interesting, but why should it depend on LUCENE-550 ?
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: didyoumean.patch.bz2, spellchecker.diff


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Mar 3, 2007 at 9:15 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477688 ]

    Karl Wettin commented on LUCENE-626:
    ------------------------------------

    Nicolas Lalevée [03/Mar/07 01:04 PM]
    This feature looks interesting, but why should it depend on LUCENE-550 ?
    It use the Index (notification, unison index factory methods, et c.) and IndexFacade (cache, fresh reader/searcher et c.) available in that patch. And by doing that, it also enables me to use InstantiatedIndex for the a priori corpus and ngram index to speed up the response time even more.
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assigned To: Karl Wettin
    Priority: Minor
    Attachments: didyoumean.patch.bz2, spellchecker.diff


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Aug 17, 2007 at 12:11 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: LUCENE-626_20070817.patch

    As the phrase-suggestion layer on top of contrib/spell in this patch was noted in a bunch of forums the last weeks, I've removed the 550-dependency and brought it up to date with the trunk.

    Second level suggesting (ngram token, phrase) can run stand alone. See TestTokenPhraseSuggester. However, I recommend the adaptive dictonary as it will act as a cache on top of second level suggestions. (See docs.)

    Output from using adaptive layer only, i.e. suggestions based on how users previously behaved. About half a million user queries analyed to build the dictionary (takes 30 seconds to build on my dual core):

    3ms pirates ofthe caribbean -> pirates of the caribbean
    2ms pirates of the carribbean -> pirates of the caribbean
    0ms pirates carricean -> pirates caribbean
    1ms pirates of the carriben -> pirates of the caribbean
    0ms pirates of the carabien -> pirates of the caribbean
    0ms pirates of the carabbean -> pirates of the caribbean
    1ms pirates og carribean -> pirates of the caribbean
    0ms pirates of the caribbean music -> pirates of the caribbean soundtrack
    0ms pirates of the caribbean soundtrack -> pirates of the caribbean score
    0ms pirate of carabian -> pirate of caribbean
    0ms pirate of caribbean -> pirates of caribbean
    0ms pirates of caribbean -> pirates of caribbean
    0ms homm 4 -> homm iv
    0ms the pilates -> null


    Using the phrase ngram token suggestion using token matrices checked against an apriori index. A lot of queries required for one suggestion. Instantiated index as apriori saves plenty of millis. This is expensive stuff, but works pretty good.

    72ms the pilates -> the pirates
    440ms heroes of fight and magic -> heroes of might and magic
    417ms heroes of right and magic -> heroes of might and magic
    383ms heroes of magic and light -> heroes of might and magic
    20ms heroesof lightand magik -> null
    385ms heroes of light and magik -> heroes of might and magic
    0ms heroesof lightand magik -> heroes of might and magic
    385ms heroes of magic and might -> heroes of might and magic

    (That 0ms is becase previous was cached. One does not have to use this cache.)
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assignee: Karl Wettin
    Priority: Minor
    Attachments: didyoumean.patch.bz2, LUCENE-626_20070817.patch, spellchecker.diff


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Aug 17, 2007 at 2:08 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520435 ]

    Karl Wettin commented on LUCENE-626:
    ------------------------------------

    RAMDirectory vs. InstantiatedIndex as apriori index: the latter is 5 to 25 times faster (leave first out).

    RAMDirectory:
    72ms the pilates -> the pirates
    440ms heroes of fight and magic -> heroes of might and magic
    417ms heroes of right and magic -> heroes of might and magic
    383ms heroes of magic and light -> heroes of might and magic
    20ms heroesof lightand magik -> null
    385ms heroes of light and magik -> heroes of might and magic
    0ms heroesof lightand magik -> heroes of might and magic
    385ms heroes of magic and might -> heroes of might and magic


    InstantiatedIndex:
    171ms the pilates -> the pirates
    66ms heroes of fight and magic -> heroes of might and magic
    36ms heroes of right and magic -> heroes of might and magic
    14ms heroes of magic and light -> heroes of might and magic
    6ms heroesof lightand magik -> null
    15ms heroes of light and magik -> heroes of might and magic
    0ms heroesof lightand magik -> heroes of might and magic
    16ms heroes of magic and might -> heroes of might and magic
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assignee: Karl Wettin
    Priority: Minor
    Attachments: didyoumean.patch.bz2, LUCENE-626_20070817.patch, spellchecker.diff


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 16, 2007 at 6:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: LUCENE-626_20070817.patch)
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assignee: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_2007_10_16.txt


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 16, 2007 at 6:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: spellchecker.diff)
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assignee: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_2007_10_16.txt


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 16, 2007 at 6:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: LUCENE-626_2007_10_16.txt

    New in this patch:

    * Dictionary persistence using BDB/je/3.2.44
    * Lots of refactoring to make the code easier to follow

    Next patch will probably focus on:

    * Pruning of dictionary
    * Documentation of non-consumer methods
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assignee: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_2007_10_16.txt


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 16, 2007 at 6:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: didyoumean.patch.bz2)
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assignee: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_2007_10_16.txt


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 16, 2007 at 6:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Description:
    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description

    Example:
    {code:java}
    public void testImportData() throws Exception {

    // load 200 000 user queries with session data and time stamp. no goals specified.

    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");

    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));

    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));

    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));

    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));

    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));

    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));

    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);

    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));

    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));

    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));


    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));

    }
    {code}


    was:
    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description

    The patch spellcheck.diff should not depend on anything but Lucene trunk. It has basic support for phrase suggestions and query goal detection, but is pretty buggy and lacks features available in didyoumean.diff.bz2. The latter depends on LUCENE-550.

    Example:
    {code:java}
    public void testImportData() throws Exception {

    // load 200 000 user queries with session data and time stamp. no goals specified.

    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");

    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));

    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));

    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));

    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));

    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));

    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));

    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);

    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));

    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));

    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));


    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));

    }
    {code}


    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Assignee: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_2007_10_16.txt


    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    Example:
    {code:java}
    public void testImportData() throws Exception {
    // load 200 000 user queries with session data and time stamp. no goals specified.
    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");
    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));
    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));
    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));
    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));
    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));
    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));
    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);
    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));
    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));
    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));
    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));
    }
    {code}
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 23, 2007 at 8:19 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: LUCENE-626_20071023.txt

    In this patch:

    * Updated package javadoc

    * Simplified consumer interface with persistent session management :

    {code:java}
    SuggestionFacade facade = new SuggestionFacade(new File("data"));
    facade.getDictionary().getPrioritesBySecondLevelSuggester().putAll(facade.secondLevelSuggestionFactory());
    ...
    QuerySession session = facade.getQuerySessionManager().sessionFactory();
    ...
    String query = "heros of mght and magik";
    Hits hits = searcher.search(queryFactory(query));
    String suggested = facade.didYouMean(query);
    session.query(query, hits.length(), suggested);
    ...
    facade.getQuerySessionManager().getSessionsByID().put(session);
    ...
    facade.trainExpiredSessions();
    ...
    facade.close();
    {code}

    * Optimizations. On my MacBook it now takes five minutes for the big unit test to process 3,500,000 queries: training the dictionary and extracts an a priori corpus by inverting the dictionary of the phrases and words people have most problem spelling.

    * Depends on LUCENE-550 by default again. When it took 30 seconds to execute 100,000 span near queries in a RAMDirectory and less than one second to do the same witn an InstantiatedIndex it simply did not make sense to use RAMDirectory as default. Replacing one line of code removes the dependency to InstantiatedIndex.

    * New algorithmic second level suggester for queries containing terms not close enough in the text to be found in the a priori. Added with lowest priority and checks against the system index rather than the a priori index. Soon the second level suggster classes will needs a bit of refactoring.



    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 23, 2007 at 8:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Assignee: (was: Karl Wettin)
    Description:
    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description

    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.

    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.

    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath

    Depends on LUCENE-550

    was:
    Extensive java docs available in patch, but I try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description

    Example:
    {code:java}
    public void testImportData() throws Exception {

    // load 200 000 user queries with session data and time stamp. no goals specified.

    System.out.println("Processing http://ginandtonique.org/~kalle/data/pirate.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/pirate.data.gz").openStream())));
    System.out.println("Processing http://ginandtonique.org/~kalle/data/hero.data.gz");
    importFile(new InputStreamReader(new GZIPInputStream(new URL("http://ginandtonique.org/~kalle/data/hero.data.gz").openStream())));
    System.out.println("Done.");

    // run some tests without the second level suggestions,
    // i.e. user behavioral data only. no ngrams or so.

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates ofthe caribbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carribbean"));

    assertEquals("pirates caribbean", facade.didYouMean("pirates carricean"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carriben"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabien"));
    assertEquals("pirates of the caribbean", facade.didYouMean("pirates of the carabbean"));

    assertEquals("pirates of the caribbean", facade.didYouMean("pirates og carribean"));

    assertEquals("pirates of the caribbean soundtrack", facade.didYouMean("pirates of the caribbean music"));
    assertEquals("pirates of the caribbean score", facade.didYouMean("pirates of the caribbean soundtrack"));

    assertEquals("pirate of caribbean", facade.didYouMean("pirate of carabian"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirate of caribbean"));
    assertEquals("pirates of caribbean", facade.didYouMean("pirates of caribbean"));

    // depening on how many hits and goals are noted with these two queries
    // perhaps the delta should be added to a synonym dictionary?
    assertEquals("homm iv", facade.didYouMean("homm 4"));

    // not yet known.. and we have no second level yet.
    assertNull(facade.didYouMean("the pilates"));

    // use the dictionary built from user queries to build the token phrase and ngram suggester.
    facade.getDictionary().getPrioritesBySecondLevelSuggester().put(Factory.ngramTokenPhraseSuggesterFactory(facade.getDictionary()), 1d);

    // now it's learned
    assertEquals("the pirates", facade.didYouMean("the pilates"));

    // typos
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of fight and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of right and magic"));
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and light"));

    // composite dictionary key not learned yet..
    assertEquals(null, facade.didYouMean("heroesof lightand magik"));
    // learn
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of light and magik"));
    // test
    assertEquals("heroes of might and magic", facade.didYouMean("heroesof lightand magik"));


    // wrong term order
    assertEquals("heroes of might and magic", facade.didYouMean("heroes of magic and might"));

    }
    {code}


    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Oct 23, 2007 at 8:21 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Karl Wettin updated LUCENE-626:
    -------------------------------

    Attachment: (was: LUCENE-626_2007_10_16.txt)
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Apr 17, 2008 at 8:06 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590189#action_12590189 ]

    Karl Wettin commented on LUCENE-626:
    ------------------------------------

    If anyone have some rather large query logs with session id, time stamp and preferably click through data that I can test on this, that would be great. It really needs to be adjusted to more than one.
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mikkel Kamstrup Erlandsen (JIRA) at Jan 22, 2010 at 11:15 am
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803666#action_12803666 ]

    Mikkel Kamstrup Erlandsen commented on LUCENE-626:
    --------------------------------------------------

    FYI: I started working on updating on Karl's patch. The code is not yet in a compilable state, but can be found on GitHub: http://github.com/mkamstrup/lucene-didyoumean

    My plans for this is to:

    * Port it to Lucene 3+ API
    * Abstract out the BDB dependency to allow for any old storage payer (BDB, JDBC, what have we)

    Thanks for the great work Karl!
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mikkel Kamstrup Erlandsen (JIRA) at Jan 25, 2010 at 2:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804554#action_12804554 ]

    Mikkel Kamstrup Erlandsen commented on LUCENE-626:
    --------------------------------------------------

    @Karl: The test sources refer a file http://ginandtonique.org/~kalle/LUCENE-626/queries_grouped.txt.gz which is not online anymore, is this resource still available somewhere?
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mikkel Kamstrup Erlandsen (JIRA) at Jan 25, 2010 at 2:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804556#action_12804556 ]

    Mikkel Kamstrup Erlandsen commented on LUCENE-626:
    --------------------------------------------------

    Status update: All tests except two are passing since commit 7e4eb7989d81e50cc81b6f33ac5fa188467f5d3e on http://github.com/mkamstrup/lucene-didyoumean :

    1) TestTokenPhraseSuggester gives me a ArrayIndexOutOfBoundsException roughly half way through the test cases (which otherwise pass)
    2) Missing the sample query log to import from the resource http://ginandtonique.org/~kalle/LUCENE-626/queries_grouped.txt.gz

    ! But be aware that this is still work in progress !
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mikkel Kamstrup Erlandsen (JIRA) at Jan 26, 2010 at 12:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805011#action_12805011 ]

    Mikkel Kamstrup Erlandsen commented on LUCENE-626:
    --------------------------------------------------

    Status update: All tests pass now, since the tag 'milestone3'

    Missing essentials:
    * An on-disk backend for Dictionary and QuerySessionManager, either via JDBC or some Lucene magic
    * More large scale testing on said on-disk backends

    Missing nice-to-haves:
    * Code cleanup
    * More javadocs
    * Optimizations (there's a lot of low hanging fruit - we sprinkle objects and string copies all over the place!)
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Karl Wettin (JIRA) at Jan 26, 2010 at 1:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805021#action_12805021 ]

    Karl Wettin commented on LUCENE-626:
    ------------------------------------

    Hej Mikkel,

    the test case data set is on an HDD hidden away on an attic 600km away from me, but I've asked for someone in the vicinity to fetch it for me. Might take a little while. Sorry!

    However extremely cool that you're working with this old beast! I'm super busy as always but I promise to follow your progress in case there is something you wonder about. It's been a few years since I looked at the code though.
    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mikkel Kamstrup Erlandsen (JIRA) at Jan 26, 2010 at 7:59 pm
    [ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805161#action_12805161 ]

    Mikkel Kamstrup Erlandsen commented on LUCENE-626:
    --------------------------------------------------

    Hej Karl,

    Super that you are still around! :-) Even more sweet that you are willing to try on turn up these files!

    Development is going pretty well, and I am finally feeling that I can wrap my head around it, but it is most valuable to have you in the back hand!


    Extended spell checker with phrase support and adaptive user session analysis.
    ------------------------------------------------------------------------------

    Key: LUCENE-626
    URL: https://issues.apache.org/jira/browse/LUCENE-626
    Project: Lucene - Java
    Issue Type: Improvement
    Components: Search
    Reporter: Karl Wettin
    Priority: Minor
    Attachments: LUCENE-626_20071023.txt


    Extensive javadocs available in patch, but I also try to keep it compiled here: http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
    A semi-retarded reinforcement learning thingy backed by algorithmic second level suggestion schemes that learns from and adapts to user behavior as queries change, suggestions are accepted or declined, etc.
    Except for detecting spelling errors it considers context, composition/decomposition and a few other things.
    heroes of light and magik -> heroes of might and magic
    vinci da code -> da vinci code
    java docs -> javadocs
    blacksabbath -> black sabbath
    Depends on LUCENE-550
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post