FAQ
explore using automaton for fuzzyquery
--------------------------------------

Key: LUCENE-2089
URL: https://issues.apache.org/jira/browse/LUCENE-2089
Project: Lucene - Java
Issue Type: Wish
Components: Search
Reporter: Robert Muir
Assignee: Mark Miller
Priority: Minor


Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
* up front, calculate the maximum required N edits needed to match the users supplied float threshold.
* for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNNNNNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)
Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
7N|10|64.0|4155.9|38.6|20.3|
14N|10|0.0|2511.6|46.0|37.9|
28N|10|0.0|2506.3|93.0|86.6|
56N|10|0.0|2524.5|304.4|298.5|
as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Mark Miller (JIRA) at Nov 22, 2009 at 3:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781136#action_12781136 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. (i will assign this to him, I know he is itching to write that nasty algorithm
    ha - too much wine last night to laugh that hard this morning. Painful.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required N edits needed to match the users supplied float threshold.
    * for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Description:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    was:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required N edits needed to match the users supplied float threshold.
    * for at least common N (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable N, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.




    modify description to be readable, K is number of edits, N refers to length of word.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:13 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Description:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    was:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high N, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Nov 22, 2009 at 3:15 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781138#action_12781138 ]

    Uwe Schindler commented on LUCENE-2089:
    ---------------------------------------

    bq. ha - too much wine last night to laugh that hard this morning. Painful.

    I need more beer, after that 3.0.0 problem with the AttributeSource API :-( (LUCENE-2088).
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:23 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781139#action_12781139 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python.
    the download link for 0.2 is not available, and it appears from the website this is the version with the updated algorithm.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781140#action_12781140 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    I hope its obvious from the benchmark why we shouldn't use the crappy prototype, and optimize K=1 (which is probably very common).
    if someone has a small index, with very long terms (bio sequences, or something), then fuzzy would actually get slower.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 3:34 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781142#action_12781142 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway :)
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 3:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781144#action_12781144 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. I'll take a look anyway - too bad I can't find my stupid automata text book - already bought the stupid thing twice in my life. Python translation sounds easier anyway

    you will need more wine. If it starts to catch, I'll be happy to offer help with use of the automaton API, etc.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 4:22 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781149#action_12781149 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    I wonder what the best way to handle this would be - these transition tables appear to get very large very fast. Looks like the python stuff is now calculating them on the fly. You wouldn't want to load them all up into RAM for those not using them. Perhaps some sort of lru cache load on demand or something - though can't really fully warm up that way.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 4:24 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781151#action_12781151 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Mark, they would get large fast, but i think we only need say 1,2,3.
    once you go high K, it will be many disk seeks anyway, and 'linear' mode like it does now will be probably faster.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 4:30 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781154#action_12781154 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Another twist, is that we have to support the 'constant prefix' as well.
    the easiest way would be that if the user asks for this, we intersect the automaton with a 'prefix automaton' such as abc.*

    but the BasicOperations.intersection is quadratic in # of states, because it supports both NFA and DFA
    we might need to implement a better DFA-only intersection alg for this constant prefix, so that supplying constant prefix doesnt actually make things slower :)

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 4:42 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781157#action_12781157 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. Mark, they would get large fast, but i think we only need say 1,2,3.

    Ah, right - you mention that above in the summary.

    bq. Another twist, is that we have to support the 'constant prefix' as well.

    Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 4:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781161#action_12781161 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. Generally, if you have any kind of length to the prefix, linear wouldn't be so bad either though right? Guess it depends on your terms though.

    not so bad, but not so good either. see my wildcard results, for example (on the 10M equally distributed terms)
    Pattern||Iter||AvgHits||AvgMS (old)||AvgMS (new)||
    N?N?N?N||10||1000.0||288.6||38.5||
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 5:28 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781167#action_12781167 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    Right, I wouldn't expect it to be great with a single char prefix - especially with the random terms your making. But I think thats a much worse case than a slightly longer prefix with a normal term distribution.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 5:36 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary.

    for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism

    so the performance is language-dependent.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 5:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781170#action_12781170 ]

    Robert Muir edited comment on LUCENE-2089 at 11/22/09 5:44 PM:
    ---------------------------------------------------------------

    Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary.

    for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism

    so the performance is language-dependent.

    also mark, you said you want to "scale" fuzzy query right? Taking a linear algorithm and dividing it by a constant (alphabet size), which is what constant prefix does, does not improve the scalability. for english, it just makes your fuzzy query 26 times faster. so imo it would be better to improve the real scalability... the constant prefix is just an optimization/hack but does not really solve the problem


    was (Author: rcmuir):
    Mark maybe, though it also depends largely on the size of the "alphabet" in the term dictionary.

    for my test, the alphabet is pretty small size, [0-9]. With smaller alphabets, say [GACD], the automaton enumeration becomes even more effective over any "prefix" mechanism

    so the performance is language-dependent.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 5:54 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781173#action_12781173 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. the constant prefix is just an optimization/hack

    Right - which is part of why I am not so worried about handling it with a new impl - its really only there now because the thing is so slow without it - if we really needed to support it, it wouldn't be so bad to fall back to as it is. Though if we can support it with the new impl fine - but I don't think I would go out of the way for it myself.

    bq. for english, it just makes your fuzzy query 26 times faster

    With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 5:58 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix.
    we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 6:02 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781174#action_12781174 ]

    Robert Muir edited comment on LUCENE-2089 at 11/22/09 6:01 PM:
    ---------------------------------------------------------------

    bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix.
    we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right.

    btw, I am implying here that we should never do a real levenshtein dynamic programming calculation if we do not have to.
    for example, in my k=1 prototype, it is never used:
    {code}
    public float difference() {
    if (currentTerm == null || currentTerm.equals(term))
    return 1.0F;
    else
    return 1.0F - (1.0F / (Math.max(currentTerm.text().length(), term.text().length())));
    }
    {code}

    instead, if up to k=3 is required, we should do these steps:
    * equals(), then 1.0F
    * matches k=1 automaton?
    * matches k=2 automaton?
    * matches k=3 automaton?

    even for k=3 i am sure this is likely faster on average than doing levenshtein, especially if the query matches many terms.
    automaton matching is linear time, and just array access.


    was (Author: rcmuir):
    bq. With a prefix of 1 again? Yeah - you really need to use a longer prefix to get much benifit - but I wouldnt think we care about a prefix of 1 with the new impl - just don't use a prefix. Its just a hack to somewhat get around the current scability issues.

    you write that crazy algorithm from the paper, and I will do the easy part (fast DFA intersection) so we can use the constant prefix.
    we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat, might as well do it right.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:04 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:08 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ]

    Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:06 PM:
    ---------------------------------------------------------------

    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?

    *edit*

    Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code.

    I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version.

    was (Author: markrmiller@gmail.com):
    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 6:12 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781179#action_12781179 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it.

    I agree with garbage, but we cannot completely replace it anyway. for example what if someone supplies a term of length 54 and asks for distance of 0.5?
    we should not use this algorithm for nonsense like that, in that case I think they should just use the garbage algorithm.

    Here is a quote from that moman page:
    It means that in theory, you could ask for a Levenshtein distance of 27! Well, if you have a week ahead of you...

    we shouldnt burn cycles creating useless tables that will be huge arrays either in fuzzyquery, or whatever. we can't compute all the way up to infinity, this is why i think something like 1,2,3 is "reasonable" , if it requires more edits than that, go with the old brute force algorithm?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:14 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781175#action_12781175 ]

    Mark Miller edited comment on LUCENE-2089 at 11/22/09 6:12 PM:
    ---------------------------------------------------------------

    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?

    *edit*

    Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code.

    I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version.

    *edit*

    Not that I can't see people wanting just the tail end to be fuzzy - in that case, fine - do your dirty work :) I just don't know how much of a real req that is compared to a prefix query. Perhaps its more useful than I'd guess myself. I think we are quite a ways from getting this alg implemented anyway :) Though I am trying to track down the latest python code.

    was (Author: markrmiller@gmail.com):
    bq. we must "use" the prefix, so the results are the same as old FuzzyQuery for back compat

    I'm not so worried about that myself. The prefix on the old Fuzzy is just to get around scalability - we could just deprecate and create a new Fuzzy without it. Why carry along the hack when we fix the root problem?

    *edit*

    Basically, what I'm saying is the old FuzzyQuery is just garbage really - I've even seen an email were Doug talked about just dropping it. He didn't like the non scalable queries. Even if this new impl didn't exactly match the results of the old, I'd have no problem just deprecating the old and saying we are starting over with a "real" fuzzyquery - the old one goes away on the next major - anyone dying to stick with it can just pull the query into their own code.

    I think if we keep the prefix, it should be for a good reason in itself, rather than just back compat with the old crappy version.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 6:16 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781180#action_12781180 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. if it requires more edits than that, go with the old brute force algorithm?

    Perhaps - but it should be an option. We don't want perf to just drop off a cliff (and thats an understatement on a large index), just because the input crossed some line. I'd almost prefer the input just doesn't work - in my mind FuzzyQuery just doesn't work on large indecies. But far be it from me to take away the option ...

    I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 6:20 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781182#action_12781182 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. I wouldnt really like it if it was default automatic - n=3 takes a few milliseconds, then n=4 takes an hour. Whoops.

    you are completely right, it should not drop off a cliff. though i think as n increases, it will get significantly worse, because of so much seeking.
    we find the nice n where this is almost as bad as brute force, and cut her off there?

    this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around.
    Uwe and I are referring this as "dumb" mode versus "smart" mode. I'm gonna add a constructor to the enum, so this can be specified rather than "auto" as well.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 7:59 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781204#action_12781204 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. we find the nice n where this is almost as bad as brute force, and cut her off there?

    yeah, that sounds good if there is a nice transition at a relatively lower n.

    bq. this is basically what i did for the enum already, if you have a wildcard with leading * or a regex with .*, its faster to linear scan than to seek around.

    Yeah, and I think this makes sense - but you will notice both the Lucene qp and Solr don't allow leading wildcard by default - the perf is generally bad enough on a large index that we assume you don't want to allow it and force users to "turn on" the option.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 8:03 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781205#action_12781205 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. but you will notice both the Lucene qp and Solr don't allow leading wildcard by default

    if we add this automaton stuff, I think we should reconsider this rule.
    instead of don't allow leading * or ? by default, just don't allow just leading * by default.

    i won't really argue for it though, because a wildcard query of "?????????????abc", probably just as bad as "*abc"
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 8:07 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781206#action_12781206 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    I think it makes sense to allow leading ? - ?????????abc is prob rare enough that its worth it to allow by default. Or perhaps a separate knob to turn it on and leave leading * off.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 8:11 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781207#action_12781207 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    mark, you are right.

    plus, the qp does not throw exceptions if you have a fuzzy query with no constant prefix, which is going to be actually worse than even leading *!!!
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 8:17 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781208#action_12781208 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    solr doesnt even allow for a constant prefix with fuzzy - its broken, broken, broken :) DisMax to the rescue.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor

    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 9:46 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Mark Miller updated LUCENE-2089:
    --------------------------------

    Attachment: Moman-0.1.tar.gz
    From Moman author:
    Absolutely. Sorry for the missing links. I had some problems with my
    provider which backed up an old version of my website.
    The library is MIT licensed, so feel free to take anything you want. I
    didn't made the subversion available since I was working
    on Daciuk's "Incremental construction of minimal acyclic finite-state
    automata", improving the memory footprint of the algorithm.
    Since I want to be sure the new algorithm is right before publishing
    it, the repository isn't available. Anyway, here's the source
    code (attached). I must warn you that there's no comments at all,
    which is unfortunate, since I was contacted by many people
    lately, that had the same kind of request than yours.

    If you need anything, just don't hesitate to ask.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Nov 22, 2009 at 9:50 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781230#action_12781230 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    Mark, from his page it seemed like 0.2 was the version with the generalized edit distance?

    {noformat}
    Moman 0.2 is out!
    (2005-07-29) This version add the possibility to use a Levenshtein distance greater than 1.
    Before, the transitions tables were static, now we build them.
    It means that in theory, you could ask for a Levenshtein distance of 27!
    Well, if you have a week ahead of you...
    {noformat}
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Mark Miller updated LUCENE-2089:
    --------------------------------

    Attachment: (was: Moman-0.1.tar.gz)
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 22, 2009 at 10:00 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Mark Miller updated LUCENE-2089:
    --------------------------------

    Attachment: Moman-0.2.1.tar.gz

    Sorry - I attached the wrong file.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Nov 25, 2009 at 3:04 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782458#action_12782458 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    I might be on the trail of a java impl - get out the hounds!

    Perhaps I won't be looking over that paper for thanksgiving after all.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 9, 2009 at 2:42 am
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Attachment: TestFuzzy.java

    attached is a testcase that builds n=1 for some string the slow way.
    you can use this to verify that some faster method is correct, via assertEquals

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Dec 9, 2009 at 9:14 am
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787989#action_12787989 ]

    Uwe Schindler commented on LUCENE-2089:
    ---------------------------------------

    Cool! The code looks quite simple (but maybe this is because of n=1). But FuzzyQuery with n>1 are used seldom, or not? And how slow it is?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 10, 2009 at 12:54 am
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Description:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common N up to K (1,2,3, etc) we should create a DFA for each N.

    if the required K is above our supported DFA-based N, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high N, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    was:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common K (1,2,3, etc) we should use automatontermenum. if its outside of that, maybe use the existing slow logic. At high K, it will seek too much to be helpful anyway.

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common N up to K (1,2,3, etc) we should create a DFA for each N.
    if the required K is above our supported DFA-based N, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high N, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 10, 2009 at 12:58 am
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Description:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.

    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    was:
    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)

    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least common N up to K (1,2,3, etc) we should create a DFA for each N.

    if the required K is above our supported DFA-based N, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high N, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".

    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.

    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.

    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.



    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 11, 2009 at 7:10 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Muir updated LUCENE-2089:
    --------------------------------

    Attachment: LUCENE-2089.patch

    here is a patch that implements the crazy algorithm.

    i only include the tables for n=1, but the algorithm is general to all n, has tests, works, and is fast.

    the next steps imho are:
    * look at how adding tables for n will help by looking at some dictionary statistics (assuming priority q of 1024)
    * add tables for those values of n
    * add the priority q attribute
    * write the enum (probably adding a little to automatontermsenum so it can be extended to do this)
    * maybe make the code better for this part.

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Dec 13, 2009 at 12:31 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789885#action_12789885 ]

    Uwe Schindler commented on LUCENE-2089:
    ---------------------------------------

    You wrote that some parts of the code are "automatically generated". Is there code available to do this for other n? Else I would move the class to the util.automaton package, as e.g. Regex parser are also there?

    I assume that I can use the generated automaton with AutomatonQuery and would hit all docs, but without scoring?
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 13, 2009 at 3:19 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789900#action_12789900 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. You wrote that some parts of the code are "automatically generated". Is there code available to do this for other n? Else I would move the class to the util.automaton package, as e.g. Regex parser are also there?

    it is not so simple... as n increases the amount of generated code gets very large. also the math to generate the code is very heavy.
    the first 'table' i simply keyed in manually. so it would be nice if these 'tables' could be more compact.... this is one reason i uploaded this code.

    bq. I assume that I can use the generated automaton with AutomatonQuery and would hit all docs, but without scoring?

    yeah for n=1 with this code. we could also 'key in' the code for n=2, too, but its a lot more and it would be better to figure out the smallest representation in java and make a code generator.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot (JIRA) at Dec 14, 2009 at 8:48 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790358#action_12790358 ]

    Earwin Burrfoot commented on LUCENE-2089:
    -----------------------------------------

    bq. by the way, the only open impl of this algorithm i could find is at http://rrette.com/moman.html (ZSpell) in python.
    I recently stumbled upon a C++ey STLey impl -> http://code.google.com/p/patl/

    bq. I might be on the trail of a java impl - get out the hounds!
    If you do take hold of it, do not hesitate to share :) The original paper and C++ code likewise melt my brain, and I needed the algo in some other place.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Dec 14, 2009 at 9:05 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790368#action_12790368 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    bq. If you do take hold of it, do not hesitate to share The original paper and C++ code likewise melt my brain, and I needed the algo in some other place.

    The java impl I was onto was about 75% complete according to the author, but I have not yet looked at the code. Robert was convinced it was a different less efficient algorithm last I heard though.

    We have cracked much of the paper - thats how Robert implemented n=1 here - thats from the paper. The next step is to work out how to construct the tables for n as Robert says above. And store those tables efficiently as they start getting quite large rather fast - though we might only use as high as n=3 or 4 in Lucene - Robert suspects term seeking will outweigh any gains at that point. I think we know how to do the majority of the work for the n case, but I don't really have much/any time for this, so it probably depends on if/when Robert gets to it. If he loses interest on finishing, I def plan to come back to it someday. I'd like to complete my understanding of the paper and see a full n java impl of this in either case. The main piece left that I don't understand fully (computing all possible states for n), can be computed with just a brute force check (thats how the python impl is doing it), so there may not be much more to understand. I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient. The paper shows them for n=1 and n=2.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 14, 2009 at 9:53 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790392#action_12790392 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    thanks for the pointer Earwin, its very useful to have something else to look at.

    Mark, definitely not losing interest on finishing, but mostly recovering from the brain damage that paper caused me.

    I will say i ran stats on various dictionaries, and I think the following:
    * I think making the defaults for fuzzy (0.5 similarity with 1024 n-best expansions) is very challenging. I actually think the pq size (1024 n-best expansions) is the big kicker here, I mean it seems a little absurd to me... If this was smaller we could support lower n and even not worry about the 0.5 similarity so much, because we would fill up the pq quicker and n would drop fast. But I think adjusting both defaults might make sense.
    * I think we should try to find ways to figuring out what good defaults should be, and this shouldnt be based just on performance (i.e. ocr-damaged test collection, something like that).

    So anyway, I'm really happy with how far this is right now, because if you look at the performance numbers at the top, we were spending 300ms building DFAs for 56 character-long words, and now this is 5ms. just need to go to the next step.


    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot (JIRA) at Dec 15, 2009 at 6:29 am
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790586#action_12790586 ]

    Earwin Burrfoot commented on LUCENE-2089:
    -----------------------------------------

    bq. I would like to know how the paper is getting 'i' parametrized state generators though - thats much more efficient.
    Are you referring to the process of generating lev-automata, or to chapter 6, where they skip generating automata and compute its states on the go instead?

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 15, 2009 at 12:05 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790712#action_12790712 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. Are you referring to the process of generating lev-automata, or to chapter 6, where they skip generating automata and compute its states on the go instead?

    no, he refers to generating the parametric 'tables' for any n
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Mark Miller (JIRA) at Dec 15, 2009 at 2:15 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790748#action_12790748 ]

    Mark Miller commented on LUCENE-2089:
    -------------------------------------

    Sorry Earwin - to be clear, we don't actually use chapter 6 - AutomataQuery needs the automata.

    You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param "tables", extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though. No big deal I guess - the computer can do the brute force method - but I wouldn't be surprised if it starts to bog down at much higher n's.
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Dec 15, 2009 at 2:25 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790756#action_12790756 ]

    Robert Muir commented on LUCENE-2089:
    -------------------------------------

    bq. You can get all the states just by taking the power set of the subsumption triangle for every base position, and then removing from each set any position thats subsumed by another. Thats what I mean by brute force. But in the paper, they boil this down to nice little i param "tables", extracting some sort of pattern from that process. They give no hint on how they do this, or whether it applicable to greater n's though.

    these parameterized states are based on minimal boundary, and it helps me a bit to start thinking in those terms (instead of E_2 think of 2#1, 3#1, 4#1). Then the parameterized tables from n=2 make a little more sense... (i'm guessing you already know this Mark, just in case)
    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Fuad Efendi (JIRA) at Feb 10, 2010 at 3:50 pm
    [ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832033#action_12832033 ]

    Fuad Efendi commented on LUCENE-2089:
    -------------------------------------

    Downloadable article (PDF):
    http://www.mitpressjournals.org/doi/pdf/10.1162/0891201042544938?cookieSet=1

    explore using automaton for fuzzyquery
    --------------------------------------

    Key: LUCENE-2089
    URL: https://issues.apache.org/jira/browse/LUCENE-2089
    Project: Lucene - Java
    Issue Type: Wish
    Components: Search
    Reporter: Robert Muir
    Assignee: Mark Miller
    Priority: Minor
    Attachments: LUCENE-2089.patch, Moman-0.2.1.tar.gz, TestFuzzy.java


    Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm)
    we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
    * up front, calculate the maximum required K edits needed to match the users supplied float threshold.
    * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E.
    if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now).
    As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq.
    This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms.
    This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode".
    i modified my wildcard benchmark to generate random fuzzy queries.
    * Pattern: 7N stands for NNNNNNN, etc.
    * AvgMS_DFA: this is the time spent creating the automaton (constructor)
    Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
    7N|10|64.0|4155.9|38.6|20.3|
    14N|10|0.0|2511.6|46.0|37.9|
    28N|10|0.0|2506.3|93.0|86.6|
    56N|10|0.0|2524.5|304.4|298.5|
    as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion.
    So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there.
    instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
    we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok.
    the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization.
    in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedNov 22, '09 at 3:01p
activeMar 11, '10 at 1:17p
posts139
users2
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase